Open martinsumner opened 1 year ago
One thing of note, in this particular environment, it had recently been possible to seed an empty cluster using aae full-sync. Why is this possible when the deltas are 20bn, but not when the deltas are 100m?
the reason is that when the sink is empty, the query to fetch the Keys/Hashes in response to the segment request return nothing, and do so << 1ms. Then the response is empty, and so takes almost no buffer space.
Likewise, if the delta is small, but the key counts are large there will be relatively few segments to send, and the process can be completed before the TCP send_timeout (30s) is breached on the spawned process at the riak_repl_aae_sink
.
This problem will only be seen of there are a large number of keys in the sink, and also a large delta with the source.
the simplest resolution to this problem will be to move the call on the riak_repl_aae_source
to riak_kv_index_hashtree:exchange_segment/3
:
https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L142
Instead of making this call within the receive loop, the call should be made by the spawned process. This will mean that:
riak_repl_aae_source
will be able to receive the segment request messages much faster;riak_repl_aae_sink
can be switched much faster into receiving the responses;This only requires that all the segment request messages can be received within 30s (there is actually some leeway, as it will require a number of response messages to fill the buffer). Even if all 1M segments are requests, this only requires a 10Mbps send rate.
The unprocessed messages on the riak_repl_aae_sink
will now sit in the message queue of the spawned process, rather than in the network buffers. At < 100 bytes per message , this should be acceptable, even if the backlog is large.
This has now been replicated in a test environment, and fix demonstrated.
Loaded 600m objects into Cluster A (4 node) Loaded 80m different objects into Cluster B (4 node)
On Cluster B enabled full-sync B -> A this fails after 37s with {error, closed} on Cluster B (source), and {error, timeout} in Cluster A (sink).
After applying this branch to A only, and re-running - there is no failure, and keys begin to replicate
When running full-sync between a source and sink, problems can occur where both the clusters have a large number of keys, but also there is a large delta between the clusters.
Scenario:
Total Keys 20bn Ring Size: 256 Delta: 100m difference
If an attempt to repair a partition is made, of the 1M segments in the AAE tree, about 400K segments will be in a delta state.
The aae_exchange will have to use this code to repair:
https://github.com/basho/riak_core/blob/riak_kv-3.0.10/src/hashtree.erl#L1139-L1153
The 400K segments must be sent to the sink, using the Remote fun. This is L1139. In this case there are 400K segments to be sent.
The Remote fun is within the
riak_repl_aae_source
module:https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L354-L378
For each of the 400K segments, it will call
async_get_segment/3
:https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L420-L421
This will push the data to the gen_tcp socket to go over the wire to the peer on the source node cluster:
https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L739-L750
What should be noted here, is that each of these segment request messages is small on the wire (at 36 bytes), however, this still leads to a lot of data (13MB) being sent very quickly (as there is no other work to pause in this send process).
At the
riak_repl_aae_sink side
, the TCP socket is set to{active, once}
. That is to say, it will pull one message at a time off the wire, leaving any other messages in the buffer. Each segment request will then be received by this function:https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L141-L143
The exchange_segment function will need to fetch a list of Keys/Hashes under that segment. In this case this will amount to about 90-100 keys. This should be a quick operation of o(1) ms - although this is relatively slow when compared to the speed at which the segment requests are being sent over the wire from the
riak_repl_aae_source
. Once the list has been fetched, it will be cast to a spawned process to be sent back to theriak_repl_aae_sink
. Note that each lits will be o(10)KB in size - much larger than the 36 byte request.https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L209-L213
Once the list of Keys/Hashes have been sent to the spawned process, then next segment request can be processed by the
riak_repl_aae_sink
.This process though has a significant flaw. It is hinted to in comments, but not resolved by the use of a second process for sending:
https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L181-L201
As the
riak_repl_aae_source
is sending the segment requests faster than they are being processed, eventually, with a large number of segment requests, the sending will be paused at and OS level, as the TCP rcv buffer fills up on the sink and the sink node reportsTCP ZeroWindow
signalling the sending socket to halt until the receive buffer has space. From that point, at best the sending process will in effect slow to the pace of the processing by theriak_repl_aae_sink
process - as further segment requests may only be sent once there is space on the buffer, and space is only created byriak_repl_aae_sink
reading from that buffer.Note though, at this stage, the
riak_repl_aae_sink
has begun to send large 10KB responses to theriak_repl_aae_source
- and theriak_repl_aae_source
is not receiving those messages. Only when all the segment requests are sent, will the hashtree process being to request from the Remote fun thosekeys_hashes
which have been sent:https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L367-L368
Until the
riak_repl_aae_source
has sent all the requests, any responses theriak_repl_aae_sink
sender has sent will have to sit in the network buffer. These can be large responses, and soon the buffer will be full, and the source node will report back to the sink nodeTCP ZeroWindow
.After the
TCP ZeroWindow
has been sent, until all the segment requests have been sent by theriak_repl_aae_source
, the sender on the sink side cannot successfully complete aTransport:send(Socket, Msg)
. With 400K segments being sent, the network buffers full, and 1ms required to handle each message in theriak_repl_aae_sink
- this could take 5 to 10 minutes.If an attempt to write to the socket is paused for 30s due to a lack of TCP Window space at the receiving node, without a Window update from the opposing side to indicate space - the connection is reset (assumed by the sink node to be broken). This inevitably happens. Both the
riak_repl_aae_source
andriak_repl_aae_sink
fail due to the TCP connection being closed.No actual repairs have happened at this stage, as no Keys/Hashes have been read at the
riak_repl_aae_source
. So the delta is not reduced, and any attempts to re-run the sync also fail with the same issue, at the same point. full-sync is permanently broken between those clusters unless the delta can be resolved through other means.