basho / riak_kv

Riak Key/Value Store
Apache License 2.0
650 stars 233 forks source link

Node failure - whether or not to "let it crash" #1714

Closed martinsumner closed 4 years ago

martinsumner commented 5 years ago

In order to correctly recover from a failure, riak_core_node_watcher must detect that the service is down on a given node. Without such an event, fallback nodes will not be started, and the vnodes used in requests will not change to reflect those fallback vnodes.

My understanding is that there are two triggers for discovering a failed node in riak_core_node_watcher:

The supervision tree for riak_kv will normally attempt to restart individual services that fail, but if they continue to fail with sufficient intensity the failure will ripple up the supervision tree and bring the node down. This will trigger a node_down message for other the riak_core_node_watcher to detect on other nodes.

As of, a long time ago, the use of health checks (the second method) was disabled by default in riak_kv: https://github.com/basho/riak_kv/commit/5860d68a8ac7eed3eb33e9346d90b946864a488f. The actual health check used seems to overlap with overload handling, and it is easy to see how it could have a negative impact with node flapping under load.

This then leads to the question:

There appears to have been a case of this with a riak customer, where it appears the file system got switched into a read-only mode on at least one node. The vnode backend was eleveldb, and the impacted vnodes were responding to PUTs with a db_write error, but this was returned as an error - it did not cause a crash of either the backend or the vnode.

As the node was still "up", there had been no crash, the vnodes on the broken node were still being selected to coordinate PUTs. Any PUT coordinated by a vnode on the failed node would then fail (a failure on a coordinator is immediately sent back to the application as a failure, there is no attempt to try the other n -1 vnodes). So the application saw an ongoing series of intermittent failures.

Eventually some impacted vnodes ran out of their lease counters, and attempting to renew leases crashed the vnode_status_manager, which couldn't write to the file system. The riak_kv_vnode monitors the the vnode_status_manager, but responds by restarting it (which doesn't crash), then making an async request to lease another counter (which does crash - and prompts another exit message for the riak_kv_vnode to handle) - https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p5/src/riak_kv_vnode.erl#L2212-L2226. As this is manually monitored (i.e. not supervised with an intensity check) the vnode vnode_status_manager entered a perpetual loop or crashes and restarts, without crashing the vnode.

There is a similar issue with the hashtree process, which crash, but enter a perpetual loop of restarting on detection of the linked loop going down, without ever failing the vnode - https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p5/src/riak_kv_vnode.erl#L282-L286.

The net effect of all of this is the cluster not doing its job. One or more nodes became unusable, without other nodes taking action to recover from this. But what is the fault here:

martinsumner commented 5 years ago

Replicated this in a riak_test now:

https://github.com/basho/riak_test/blob/mas-i1714-readonlyfs/tests/verify_readonly.erl

martinsumner commented 5 years ago

It should be noted from the test that non-eleveldb persisted backends behave as expected. So given that, the right place to resolve this is in the backend and not in the riak_kv_vnode or above in the stack.

Given that eleveldb is a reused component across other projects, it would be preferable to catch the db_write error in the riak_kv_eleveldb_backend.

martinsumner commented 5 years ago

Actually - correction to the above. The behaviour of bitcask wasn't consistent in test, and bitcask too will run despite a file system failure, repeatedly returning an error rather than crashing to cause the node to fail.