Node failure - whether or not to "let it crash"

martinsumner commented 5 years ago

In order to correctly recover from a failure, riak_core_node_watcher must detect that the service is down on a given node. Without such an event, fallback nodes will not be started, and the vnodes used in requests will not change to reflect those fallback vnodes.

My understanding is that there are two triggers for discovering a failed node in riak_core_node_watcher:

erlang node monitoring through net_kernel
registering a health_check to test the availability of a particular service on a node

The supervision tree for riak_kv will normally attempt to restart individual services that fail, but if they continue to fail with sufficient intensity the failure will ripple up the supervision tree and bring the node down. This will trigger a node_down message for other the riak_core_node_watcher to detect on other nodes.

As of, a long time ago, the use of health checks (the second method) was disabled by default in riak_kv: https://github.com/basho/riak_kv/commit/5860d68a8ac7eed3eb33e9346d90b946864a488f. The actual health check used seems to overlap with overload handling, and it is easy to see how it could have a negative impact with node flapping under load.

This then leads to the question:

What if a failure can render a riak-kv node inoperable, but there are no crashes in the supervision tree?

There appears to have been a case of this with a riak customer, where it appears the file system got switched into a read-only mode on at least one node. The vnode backend was eleveldb, and the impacted vnodes were responding to PUTs with a db_write error, but this was returned as an error - it did not cause a crash of either the backend or the vnode.

As the node was still "up", there had been no crash, the vnodes on the broken node were still being selected to coordinate PUTs. Any PUT coordinated by a vnode on the failed node would then fail (a failure on a coordinator is immediately sent back to the application as a failure, there is no attempt to try the other n -1 vnodes). So the application saw an ongoing series of intermittent failures.

Eventually some impacted vnodes ran out of their lease counters, and attempting to renew leases crashed the vnode_status_manager, which couldn't write to the file system. The riak_kv_vnode monitors the the vnode_status_manager, but responds by restarting it (which doesn't crash), then making an async request to lease another counter (which does crash - and prompts another exit message for the riak_kv_vnode to handle) - https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p5/src/riak_kv_vnode.erl#L2212-L2226. As this is manually monitored (i.e. not supervised with an intensity check) the vnode vnode_status_manager entered a perpetual loop or crashes and restarts, without crashing the vnode.

There is a similar issue with the hashtree process, which crash, but enter a perpetual loop of restarting on detection of the linked loop going down, without ever failing the vnode - https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p5/src/riak_kv_vnode.erl#L282-L286.

The net effect of all of this is the cluster not doing its job. One or more nodes became unusable, without other nodes taking action to recover from this. But what is the fault here:

Should eleveldb have crashed on the db_write error, rather than returning {error, Reason}?
- Although if it did, given the immediate impact of a NIF crash, is this too draconian a response? Perhaps the riak_kv_eleveldb_backend should detect this as a bad case and crash?
Should the riak_kv_vnode crash on receiving an error from a co-ordinated PUT attempt?
Should the PUT_FSM select another coordinator if the original one returns an error?
- At least then a single node failing under these conditions would not lead to actual write errors.
- The customer was running Riak KV 2.2.0, there may have been some assistance here in running the soft vnode queue checks in Riak KV 2.9.0.
Should there be some pseudo intensity monitoring on restarts of hashtrees and vnode_status_manager to crash the vnode?
- It should be noted that if in this case the vnode would have crashed once, it would have crashed immediately on startup, so an appropriate and immediate cascade up the supervision tree would have occurred.
Should a better riak_kv health_check be used (one which performed a more genuine check of vnode health), rather than simply disabling the health checks?
Is my understanding incomplete? Are there other mechanisms for detecting node failure (other than outright crash or the disabled health check) that should have prompted recovery from this failure scenario.

martinsumner commented 5 years ago

Replicated this in a riak_test now:

https://github.com/basho/riak_test/blob/mas-i1714-readonlyfs/tests/verify_readonly.erl

martinsumner commented 5 years ago

It should be noted from the test that non-eleveldb persisted backends behave as expected. So given that, the right place to resolve this is in the backend and not in the riak_kv_vnode or above in the stack.

Given that eleveldb is a reused component across other projects, it would be preferable to catch the db_write error in the riak_kv_eleveldb_backend.

martinsumner commented 5 years ago

Actually - correction to the above. The behaviour of bitcask wasn't consistent in test, and bitcask too will run despite a file system failure, repeatedly returning an error rather than crashing to cause the node to fail.

basho / riak_kv

Node failure - whether or not to "let it crash" #1714