basho / riak_repl

Riak DC Replication
Apache License 2.0
56 stars 32 forks source link

Failure to complete full-sync on hitting soft retry limit #799

Closed martinsumner closed 4 years ago

martinsumner commented 5 years ago

Potentially related to - https://github.com/basho/riak_repl/issues/772

The test https://github.com/basho/riak_test/blob/develop-2.9/tests/repl_aae_fullsync_blocked.erl fails intermittently.

It fails when it uses an intercept to stop a full-sync from working on some vnodes, and checks the right number of vnode sync failures has occurred on completion.

When the test fails, it fails as full-sync is never considered complete. The difference between success and failure is related to the ordering of the vnodes which the full-sync tries. If the last vnode to be sync'd does sync OK (as it is not one with an intercepted function), then the test passes, and the correct number of vnodes failures are reported. If the last vnode to be sync'd is one of those to not sync though, although the same work has completed/failed - the full-sync is never recorded as complete.

The cause of this appears to be that on hitting the soft retry limit https://github.com/basho/riak_repl/blob/24c6e8f408450cdf48b7259f9f0ef0778f94180d/src/riak_repl2_fscoordinator.erl#L561-L571 the function maybe_complete_fullsync/2 isn't called. Unlike on a hard failure - https://github.com/basho/riak_repl/blob/24c6e8f408450cdf48b7259f9f0ef0778f94180d/src/riak_repl2_fscoordinator.erl#L590 - and unlike on a success - https://github.com/basho/riak_repl/blob/24c6e8f408450cdf48b7259f9f0ef0778f94180d/src/riak_repl2_fscoordinator.erl#L447.

martinsumner commented 5 years ago

https://github.com/basho/riak_repl/pull/800