basho / riak_repl

Riak DC Replication
Apache License 2.0
56 stars 32 forks source link

Negative unacked being reported in realtime queue with high number of dropped objects #458

Open jcapricebasho opened 10 years ago

jcapricebasho commented 10 years ago

In the riak-repl status below, the number of unacked messages from CA to XV is negative. This was observed in Riak Enterprise 1.4.2 that includes the patched beams from https://github.com/basho/internal_wiki/wiki/Riak-1.4.2-release---beams-for-openx, which include the {CSeq - ASeq - Skips} change that addressed another cause for unacked going into the negatives.

realtime_enabled: "riak-tq.prod.xf, riak-tq.prod.xv"
realtime_started: "riak-tq.prod.xf, riak-tq.prod.xv"
fullsync_enabled: "riak-tq.prod.xf, riak-tq.prod.xv"
fullsync_running: []
proxy_get_enabled: []
riak_repl_stat_ts: 1385153033
server_bytes_sent: 3277772528
server_bytes_recv: 0
server_connects: 0
server_connect_errors: 0
server_fullsyncs: 0
client_bytes_sent: 6243505060
client_bytes_recv: 0
client_connects: 0
client_connect_errors: 0
client_redirect: 0
objects_dropped_no_clients: 0
objects_dropped_no_leader: 0
objects_sent: 2023726
objects_forwarded: 0
elections_elected: 0
elections_leader_changed: 0
client_rx_kbps: [0,0,0,0,0,0,0,0]
client_tx_kbps: [0,0,0,0,0,0,0,0]
server_rx_kbps: [0,0,0,0,0,0,0,0]
server_tx_kbps: [0,0,0,0,0,0,0,0]
rt_source_errors: 458
rt_sink_errors: 0
rt_dirty: 128
leader: undefined
local_leader_message_queue_len: 0
local_leader_heap_size: 233
client_stats: []
sinks: [{sink_stats,
            [{pid,"<0.3230.5531>"},
             {message_queue_len,375},
             {rt_sink_connected_to,too_busy}]},
        {sink_stats,
            [{pid,"<0.15195.7809>"},
             {message_queue_len,0},
             {rt_sink_connected_to,
                 [{source,"riak-tq.prod.xv"},
                  {pid,"<0.15195.7809>"},
                  {connected,true},
                  {transport,ranch_tcp},
                  {socket,
                      [{peername,"10.1.30.36:34764"},
                       {sockname,"10.5.6.22:9085"},
                       {recv_avg,"[456, 456, 456, 456, 456, 456, 456]"},
                       {recv_cnt,"[0, 1, 1, 0, 1, 1]"},
                       {recv_dvi,"[6, 6, 6, 6, 6, 6, 6]"},
                       {recv_kbps,"[0, 0, 0, 0, 0, 0]"},
                       {recv_max,"[1460, 1460, 1460, 1460, 1460, 1460, 1460]"},
                       {send_cnt,"[0, 1, 1, 0, 1, 1]"},
                       {send_kbps,"[0, 0, 0, 0, 0, 0]"},
                       {send_pend,"[0, 0, 0, 0, 0, 0, 0]"}]},
                  {hb_last,{1385,153038,912589}},
                  {helper,"<0.15197.7809>"},
                  {active,true},
                  {deactivated,5042},
                  {source_drops,924},
                  {expect_seq,68422954},
                  {acked_seq,68422953},
                  {pending,0}]}]}]
sources: [{source_stats,
              [{pid,"<0.31044.5995>"},
               {message_queue_len,0},
               {rt_source_connected_to,
                   [{source,"riak-tq.prod.xf"},
                    {pid,"<0.31044.5995>"},
                    {connected,true},
                    {transport,ranch_tcp},
                    {socket,
                        [{peername,"10.33.5.31:9085"},
                         {sockname,"10.5.6.22:64139"},
                         {recv_avg,"[18, 18, 18, 18, 18, 18, 18]"},
                         {recv_cnt,"[59, 58, 59, 58, 59, 63]"},
                         {recv_dvi,"[0, 0, 0, 0, 0, 0, 0]"},
                         {recv_kbps,"[0, 0, 0, 0, 0, 0]"},
                         {recv_max,
                             "[1460, 1460, 1460, 1460, 1460, 1460, 1460]"},
                         {send_cnt,"[59, 58, 59, 59, 58, 63]"},
                         {send_kbps,"[20, 20, 20, 20, 20, 22]"},
                         {send_pend,"[0, 0, 0, 0, 0, 0, 0]"}]},
                    {helper_pid,"<0.31526.5995>"},
                    {hb_rtt,101},
                    {sent_seq,9052585},
                    {objects,183137}]}]},
          {source_stats,
              [{pid,"<0.20719.369>"},
               {message_queue_len,0},
               {rt_source_connected_to,
                   [{source,"riak-tq.prod.xv"},
                    {pid,"<0.20719.369>"},
                    {connected,true},
                    {transport,ranch_tcp},
                    {socket,
                        [{peername,"10.1.30.36:9085"},
                         {sockname,"10.5.6.22:34304"},
                         {recv_avg,"[5, 5, 5, 5, 5, 5, 5]"},
                         {recv_cnt,"[1, 1, 0, 1, 1, 0]"},
                         {recv_dvi,"[0, 0, 0, 0, 0, 0, 0]"},
                         {recv_kbps,"[0, 0, 0, 0, 0, 0]"},
                         {recv_max,"[95, 95, 95, 95, 95, 95, 95]"},
                         {send_cnt,"[1, 1, 0, 1, 1, 0]"},
                         {send_kbps,"[0, 0, 0, 0, 0, 0]"},
                         {send_pend,"[0, 0, 0, 0, 0, 0, 0]"}]},
                    {helper_pid,"<0.21179.369>"},
                    {hb_rtt,62},
                    {sent_seq,674752},
                    {objects,26}]}]}]
fullsync_coordinator: [{"riak-tq.prod.xf",
                        [{cluster,"riak-tq.prod.xf"},
                         {queued,0},
                         {in_progress,0},
                         {starting,0},
                         {successful_exits,256},
                         {error_exits,0},
                         {busy_nodes,0},
                         {running_stats,[]},
                         {socket,
                             [{peername,"10.33.8.40:9085"},
                              {sockname,"10.5.11.39:63704"},
                              {recv_avg,"[65, 65, 65, 65, 65, 65, 65]"},
                              {recv_cnt,"[0, 0, 0, 0, 0, 0]"},
                              {recv_dvi,"[0, 0, 0, 0, 0, 0, 0]"},
                              {recv_kbps,"[0, 0, 0, 0, 0, 0]"},
                              {recv_max,"[95, 95, 95, 95, 95, 95, 95]"},
                              {send_cnt,"[0, 0, 0, 0, 0, 0]"},
                              {send_kbps,"[0, 0, 0, 0, 0, 0]"},
                              {send_pend,"[0, 0, 0, 0, 0, 0, 0]"}]},
                         {fullsyncs_completed,1},
                         {last_fullsync_started,undefined},
                         {last_fullsync_duration,16540},
                         {fullsync_suggested,
                             "riak@10.5.11.39,riak@10.5.12.46,riak@10.5.18.23,riak@10.5.19.48,riak@10.5.33.21,riak@10.5.34.38,riak@10.5.39.30,riak@10.5.6.22,riak@10.5.7.15,riak@10.5.8.36"},
                         {fullsync_suggested_during_fs,[]}]},
                       {"riak-tq.prod.xv",
                        [{cluster,"riak-tq.prod.xv"},
                         {queued,0},
                         {in_progress,0},
                         {starting,0},
                         {successful_exits,256},
                         {error_exits,0},
                         {busy_nodes,0},
                         {running_stats,[]},
                         {socket,
                             [{peername,"10.1.29.32:9085"},
                              {sockname,"10.5.11.39:20801"},
                              {recv_avg,"[70, 70, 70, 70, 70, 70, 70]"},
                              {recv_cnt,"[0, 0, 0, 0, 0, 0]"},
                              {recv_dvi,"[0, 0, 0, 0, 0, 0, 0]"},
                              {recv_kbps,"[0, 0, 0, 0, 0, 0]"},
                              {recv_max,"[95, 95, 95, 95, 95, 95, 95]"},
                              {send_cnt,"[0, 0, 0, 0, 0, 0]"},
                              {send_kbps,"[0, 0, 0, 0, 0, 0]"},
                              {send_pend,"[0, 0, 0, 0, 0, 0, 0]"}]},
                         {fullsyncs_completed,3},
                         {last_fullsync_started,undefined},
                         {last_fullsync_duration,3477},
                         {fullsync_suggested,
                             "riak@10.5.11.39,riak@10.5.12.46,riak@10.5.18.23,riak@10.5.19.48,riak@10.5.33.21,riak@10.5.34.38,riak@10.5.39.30,riak@10.5.6.22,riak@10.5.7.15,riak@10.5.8.36"},
                         {fullsync_suggested_during_fs,[]}]}]
fullsync_coordinator_srv: []
cluster_name: <<"riak-tq.prod.ca">>
cluster_leader: 'riak@10.5.11.39'
connected_clusters: [<<"riak-tq.prod.xf">>,<<"riak-tq.prod.xv">>]
realtime_queue_stats: [{bytes,94866313},
                       {max_bytes,104857600},
                       {consumers,[{"riak-tq.prod.xf",
                                    [{pending,112633},
                                     {unacked,2},
                                     {drops,851806},
                                     {errs,1}]},
                                   {"riak-tq.prod.xv",
                                    [{pending,119044},
                                     {unacked,-105177},
                                     {drops,5101841},
                                     {errs,1}]}]},
                       {overload_drops,1642373}]
proxy_get: [{requester,[]},{provider,[]}]
realtime_send_kbps: 20
realtime_recv_kgbps: 0
fullsync_send_kbps: 0
fullsync_recv_kbps: 0
lordnull commented 10 years ago

The most likely cause for negative acks to happen is if skips are getting out of sync with the cseq (last sent sequence number) of a consumer. Under normal circumstances, this can't happen, but there is one place where the cseq (and aseq (last acked sequence number) is changed but skips is not): https://github.com/basho/riak_repl/blob/develop/src/riak_repl2_rtq.erl#L665

That is in the trim queue function. If the queue is trimmed beyond the cseq of a consumer, that consumer has it's cseq and aseq pulled to the queue's smallest sequence number. This means that the above calculation becomes 0 - 0 - skips. Looks like there were a ton of skips.

Resolution for this issue should probably have 2 things:

jcapricebasho commented 10 years ago

https://github.com/basho/riak_ee-issues/issues/12

cmeiklejohn commented 10 years ago

Moving to 2.0.1.

cmeiklejohn commented 10 years ago

Moving to 2.1.