riak_repl_aae full-sync deadlock

martinsumner commented 1 year ago

When running full-sync between a source and sink, problems can occur where both the clusters have a large number of keys, but also there is a large delta between the clusters.

Scenario:

Total Keys 20bn Ring Size: 256 Delta: 100m difference

If an attempt to repair a partition is made, of the 1M segments in the AAE tree, about 400K segments will be in a delta state.

The aae_exchange will have to use this code to repair:

https://github.com/basho/riak_core/blob/riak_kv-3.0.10/src/hashtree.erl#L1139-L1153

The 400K segments must be sent to the sink, using the Remote fun. This is L1139. In this case there are 400K segments to be sent.

The Remote fun is within the riak_repl_aae_source module:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L354-L378

For each of the 400K segments, it will call async_get_segment/3:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L420-L421

This will push the data to the gen_tcp socket to go over the wire to the peer on the source node cluster:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L739-L750

What should be noted here, is that each of these segment request messages is small on the wire (at 36 bytes), however, this still leads to a lot of data (13MB) being sent very quickly (as there is no other work to pause in this send process).

At the riak_repl_aae_sink side, the TCP socket is set to {active, once}. That is to say, it will pull one message at a time off the wire, leaving any other messages in the buffer. Each segment request will then be received by this function:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L141-L143

The exchange_segment function will need to fetch a list of Keys/Hashes under that segment. In this case this will amount to about 90-100 keys. This should be a quick operation of o(1) ms - although this is relatively slow when compared to the speed at which the segment requests are being sent over the wire from the riak_repl_aae_source. Once the list has been fetched, it will be cast to a spawned process to be sent back to the riak_repl_aae_sink. Note that each lits will be o(10)KB in size - much larger than the 36 byte request.

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L209-L213

Once the list of Keys/Hashes have been sent to the spawned process, then next segment request can be processed by the riak_repl_aae_sink.

This process though has a significant flaw. It is hinted to in comments, but not resolved by the use of a second process for sending:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L181-L201

As the riak_repl_aae_source is sending the segment requests faster than they are being processed, eventually, with a large number of segment requests, the sending will be paused at and OS level, as the TCP rcv buffer fills up on the sink and the sink node reports TCP ZeroWindow signalling the sending socket to halt until the receive buffer has space. From that point, at best the sending process will in effect slow to the pace of the processing by the riak_repl_aae_sink process - as further segment requests may only be sent once there is space on the buffer, and space is only created by riak_repl_aae_sink reading from that buffer.

Note though, at this stage, the riak_repl_aae_sink has begun to send large 10KB responses to the riak_repl_aae_source - and the riak_repl_aae_source is not receiving those messages. Only when all the segment requests are sent, will the hashtree process being to request from the Remote fun those keys_hashes which have been sent:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_source.erl#L367-L368

Until the riak_repl_aae_source has sent all the requests, any responses the riak_repl_aae_sink sender has sent will have to sit in the network buffer. These can be large responses, and soon the buffer will be full, and the source node will report back to the sink node TCP ZeroWindow.

After the TCP ZeroWindow has been sent, until all the segment requests have been sent by the riak_repl_aae_source, the sender on the sink side cannot successfully complete a Transport:send(Socket, Msg). With 400K segments being sent, the network buffers full, and 1ms required to handle each message in the riak_repl_aae_sink - this could take 5 to 10 minutes.

If an attempt to write to the socket is paused for 30s due to a lack of TCP Window space at the receiving node, without a Window update from the opposing side to indicate space - the connection is reset (assumed by the sink node to be broken). This inevitably happens. Both the riak_repl_aae_source and riak_repl_aae_sink fail due to the TCP connection being closed.

No actual repairs have happened at this stage, as no Keys/Hashes have been read at the riak_repl_aae_source. So the delta is not reduced, and any attempts to re-run the sync also fail with the same issue, at the same point. full-sync is permanently broken between those clusters unless the delta can be resolved through other means.

martinsumner commented 1 year ago

One thing of note, in this particular environment, it had recently been possible to seed an empty cluster using aae full-sync. Why is this possible when the deltas are 20bn, but not when the deltas are 100m?

the reason is that when the sink is empty, the query to fetch the Keys/Hashes in response to the segment request return nothing, and do so << 1ms. Then the response is empty, and so takes almost no buffer space.

Likewise, if the delta is small, but the key counts are large there will be relatively few segments to send, and the process can be completed before the TCP send_timeout (30s) is breached on the spawned process at the riak_repl_aae_sink.

This problem will only be seen of there are a large number of keys in the sink, and also a large delta with the source.

martinsumner commented 1 year ago

the simplest resolution to this problem will be to move the call on the riak_repl_aae_source to riak_kv_index_hashtree:exchange_segment/3:

https://github.com/basho/riak_repl/blob/riak_kv-3.0.11/src/riak_repl_aae_sink.erl#L142

Instead of making this call within the receive loop, the call should be made by the spawned process. This will mean that:

The riak_repl_aae_source will be able to receive the segment request messages much faster;
So that the riak_repl_aae_sink can be switched much faster into receiving the responses;
So that no response message spends 30s with no TCP window space.

This only requires that all the segment request messages can be received within 30s (there is actually some leeway, as it will require a number of response messages to fill the buffer). Even if all 1M segments are requests, this only requires a 10Mbps send rate.

The unprocessed messages on the riak_repl_aae_sink will now sit in the message queue of the spawned process, rather than in the network buffers. At < 100 bytes per message , this should be acceptable, even if the backlog is large.

martinsumner commented 1 year ago

This has now been replicated in a test environment, and fix demonstrated.

Loaded 600m objects into Cluster A (4 node) Loaded 80m different objects into Cluster B (4 node)

On Cluster B enabled full-sync B -> A this fails after 37s with {error, closed} on Cluster B (source), and {error, timeout} in Cluster A (sink).

After applying this branch to A only, and re-running - there is no failure, and keys begin to replicate

basho / riak_repl

riak_repl_aae full-sync deadlock #817