basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 392 forks source link

riak_core_claimant tick can occur after close #943

Closed martinsumner closed 4 years ago

martinsumner commented 4 years ago

When a node is leaving a cluster the final trigger to exit is a call to riak_core_ring_manager:refresh_my_ring/0

https://github.com/basho/riak_core/blob/9f7d0538e01e248dc4f5e0efff328737f23d8fb9/src/riak_core_ring_manager.erl#L377-L388

The refresh_my_ring call will be prompted from riak_core_ring_handler:ensure_vnodes_started/1:

https://github.com/basho/riak_core/blob/9f7d0538e01e248dc4f5e0efff328737f23d8fb9/src/riak_core_ring_handler.erl#L61-L99

This will in turn be prompted by the management tick in the riak_core_vnode_manager.

The refresh_my_ring logic will generate a fresh ring, persist, before prompting riak_core to close.

However, this will prompt the close process to start a structured shutdown of the application, and that shutdown might not be immediate. As it is not immediate, before the shutdown is complete the next tick event may occur on the riak_core_claimant:

https://github.com/basho/riak_core/blob/9f7d0538e01e248dc4f5e0efff328737f23d8fb9/src/riak_core_claimant.erl#L632-L644

This can create an issue for riak_ensemble. riak_ensemble is dependent on certain changes to the root ensemble being prompted by the claimant (a singleton claimant within the cluster).

https://github.com/basho/riak_core/blob/9f7d0538e01e248dc4f5e0efff328737f23d8fb9/src/riak_core_claimant.erl#L830-L863

If the single to claimant is a node which has been set to leave, then we need that claimant node to prompt the removal changes to the root ensemble. However, if the schedule tick required to make this happen occurs after the ring has been refreshed - this may lead to a bad situation. In this case the refreshed ring will not have the local node() removed, but instead the local node() will be the only member of the ring.

Now when bootstrapping the root_ensemble as a result of the tick, the code will see all members of the root ensemble other than the local node as having been removed - and will spawn a process to prompt their removal.

https://github.com/basho/riak_core/blob/9f7d0538e01e248dc4f5e0efff328737f23d8fb9/src/riak_core_claimant.erl#L861

This is erroneous - in fact the local node is the only one which should be removed.

This leads to an intermittent failure of the riak_test ensemble_remove_node (which can made to fail fairly reliably by increasing timeouts). It fails as the riak_ensemble_manager:cluster/0 has no members.

https://github.com/basho/riak_test/blob/8fa9c836a98871d0982f9286cab8d94445129ce5/tests/ensemble_remove_node.erl#L70

The precise implications of this is unclear - but this doesn't appear to heal naturally - the cluster state remains incorrect.

martinsumner commented 4 years ago

The resolution to this I think is to make a call to the riak_core_claimant during the handle_call of refresh_my_ring prior to the ring refresh - and this call should stop any future tick from being accepted, and prompt the maybe_bootstrap_root_ensemble/1.

This way we can be sure that maybe_bootstrap_root_ensemble/1 is called prior to the ring being refreshed, and never after.

martinsumner commented 4 years ago

It may be useful in a couple of other places to have riak_core_claimant go into a pending_close state which ignores ticks, and will respond to a is_pending_close/0 call.

Currently the riak_repl riak_repl_ring_handler may crash on leave -> exit when it reacts to the ring change to a flushed ring (and hits a set_env timeout, perhaps due to the shutdown). There are also issues during slow shutdowns or riak_core starting all the vnodes required by the flushed ring whilst it is waiting for the shutdown to complete.

martinsumner commented 4 years ago

Perhaps a closing ring should have some other metadata (similar to setting a tainted ring), and that be checked by processes reacting to a ring change. That might be better then discovering the pending close via riak_core_claimant

martinsumner commented 4 years ago

There are branches develop-3.0-lastgaspring that now pass the test: https://github.com/basho/riak_core/tree/develop-3.0-lastgaspring.

There are also related branches under the same name for riak_kv and riak_repl that also address messy shutdown on exit issues using the new ring metadata made available in riak_core.

martinsumner commented 4 years ago

https://github.com/basho/riak_core/pull/945