basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 392 forks source link

Handoff and TCP RECV timeout #994

Closed martinsumner closed 1 year ago

martinsumner commented 1 year ago

Summary of Confusion/Problem:

Handoffs on busy riak_kv clusters often fail, with the reported issue being TCP RECV timeouts.

The code of handoff includes references to support very old versions:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L129-L133

Also there is reference to a re-write to resolve confusion happening soon:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L198-L202

Handoffs timing out are disruptive, as the handoff must start from the beginning (in terms of the fold over the vnode), and re-send all of the data at the next attempt.

When leaving/joining this can be controlled through handoff concurrency, but this isn't perfect, and still can lead to continuous failures.

Hinted handoffs can be even more problematic, particular when receiving vnodes are already subject to high loads. There is very little tuning/testing that can be done to predict what a safe transfer limit might be. There is also a vicious circle that can form - as if hinted handoffs fail, the receiving cluster may be subject to increasing amounts of anti-entropy and read-repair load that will cause hinted handoffs to continue to fail.

This issue is exacerbated that the timeout's reported as TCP RECV timeouts are not necessarily low-level network timeouts, but are more likely as a result of the receiver failing to respond to a OSI L7 SYNC message. There is also a confusion of timeout settings:

Calling the function get_handoff_receive_timeout/0 returns the handoff_timeout, which defaults to the TCP_TIMEOUT:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L514-L515

There is also a different handoff_receive timeout on the sender, based on a receiver setting:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L293_L298

Where the default is to be a third of the receiver setting, if set, but by default it is not set, and does not default to a third of the default chosen when unset on the receiver side:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_receiver.erl#L78

How are these timeouts to be tuned to prevent timeouts? A lower handoff_receive_timeout on the sender side, will lead to more frequent SYNCs - and hence make a backlog leading to a timeout less likely; but have different consequences within the receiver.

Proposal:

There are some broad changes proposed:

The aim must be to make this changes without adding another generation of forward/backward compatibility checking. Avoid making any change that requires version checking.

Related:

There was some debate, some time ago, about the wisdom of performing read-repairs to fallbacks. There was intention at some stage to add a configuration option to read repair to primaries only. This is related to hinted handoff performance - as the volume of changes which are handed off during a hinted handoff is adversely impacted by read-repairs. i.e. is a node recovers with data in tact, it will receive in handoff not just the data missed since it went down, but all the data read since it went down.

There is some similarity between this issue and https://github.com/basho/riak_repl/issues/817. There may be factor here in changes to gen_tcp within OTP, especially with regards to timeout defaults. Using the {active, once} setting and allowing for large backlogs to form can lead to undexpected and hard-to-troubleshoot failures. It is better to avoid the consequences of backlogs buffering in TCP receive windows.

martinsumner commented 1 year ago

After an initial edit of the code, realised there already existed the helpful handoff_acksync_threshold:

https://github.com/basho/riak_core/blob/riak_kv-3.0.12/src/riak_core_handoff_sender.erl#L141

This appears to be more useful than setting the timer. If we want to ensure that the receiver is up to date with the sender, then setting this to a much lower value may be easier.

Using a handoff_acksync_threshold of 1 will cause a sync of every batch (as discussed in the issue), but the sync will hold up the fold producing the next batch. It would be better to wait until we're at the point of sending a new batch to test to see if the sync is working.

That way folds can continue in parallel to results being processed by the receiver, but never get ahead.

Whatever we do should support those that have already configured handoff_acksync_threshold to a lower value. That is to say, we may change the default value, but should still respect an existing value, and give expected behaviour (reducing the threshold increases the frequency of sync checks).

martinsumner commented 1 year ago

After further review, the meaning of handoff_receive_timeout is clearer. The receiver must distinguish between an exited handoff (where no update has been received) and a handoff but is ongoing but the fold is slow (perhaps may objects are being filtered).

To achieve this the riak_core_handoff_receiver has a handoff_receive_timeout and will exit should there be inactivity to this length since the last inbound message was processed. In order to prevent the receiver timeout from triggering, the sender must send a keepalive when the fold is ongoing but not otherwise generating messages.

This is will required in any change.

martinsumner commented 1 year ago

https://github.com/basho/riak_kv/pull/1844

martinsumner commented 1 year ago

https://github.com/basho/riak_core/pull/995

martinsumner commented 1 year ago

https://github.com/basho/riak_test/pull/1369