Closed martinsumner closed 4 years ago
Replicated this in a riak_test now:
https://github.com/basho/riak_test/blob/mas-i1714-readonlyfs/tests/verify_readonly.erl
It should be noted from the test that non-eleveldb persisted backends behave as expected. So given that, the right place to resolve this is in the backend and not in the riak_kv_vnode
or above in the stack.
Given that eleveldb is a reused component across other projects, it would be preferable to catch the db_write error in the riak_kv_eleveldb_backend
.
Actually - correction to the above. The behaviour of bitcask wasn't consistent in test, and bitcask too will run despite a file system failure, repeatedly returning an error rather than crashing to cause the node to fail.
In order to correctly recover from a failure,
riak_core_node_watcher
must detect that the service is down on a given node. Without such an event, fallback nodes will not be started, and the vnodes used in requests will not change to reflect those fallback vnodes.My understanding is that there are two triggers for discovering a failed node in
riak_core_node_watcher
:The supervision tree for riak_kv will normally attempt to restart individual services that fail, but if they continue to fail with sufficient intensity the failure will ripple up the supervision tree and bring the node down. This will trigger a
node_down
message for other the riak_core_node_watcher to detect on other nodes.As of, a long time ago, the use of health checks (the second method) was disabled by default in riak_kv: https://github.com/basho/riak_kv/commit/5860d68a8ac7eed3eb33e9346d90b946864a488f. The actual health check used seems to overlap with overload handling, and it is easy to see how it could have a negative impact with node flapping under load.
This then leads to the question:
There appears to have been a case of this with a riak customer, where it appears the file system got switched into a read-only mode on at least one node. The vnode backend was eleveldb, and the impacted vnodes were responding to PUTs with a db_write error, but this was returned as an error - it did not cause a crash of either the backend or the vnode.
As the node was still "up", there had been no crash, the vnodes on the broken node were still being selected to coordinate PUTs. Any PUT coordinated by a vnode on the failed node would then fail (a failure on a coordinator is immediately sent back to the application as a failure, there is no attempt to try the other n -1 vnodes). So the application saw an ongoing series of intermittent failures.
Eventually some impacted vnodes ran out of their lease counters, and attempting to renew leases crashed the
vnode_status_manager
, which couldn't write to the file system. Theriak_kv_vnode
monitors the thevnode_status_manager
, but responds by restarting it (which doesn't crash), then making an async request to lease another counter (which does crash - and prompts another exit message for the riak_kv_vnode to handle) - https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p5/src/riak_kv_vnode.erl#L2212-L2226. As this is manually monitored (i.e. not supervised with an intensity check) the vnode vnode_status_manager entered a perpetual loop or crashes and restarts, without crashing the vnode.There is a similar issue with the hashtree process, which crash, but enter a perpetual loop of restarting on detection of the linked loop going down, without ever failing the vnode - https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p5/src/riak_kv_vnode.erl#L282-L286.
The net effect of all of this is the cluster not doing its job. One or more nodes became unusable, without other nodes taking action to recover from this. But what is the fault here:
Should eleveldb have crashed on the db_write error, rather than returning {error, Reason}?
riak_kv_eleveldb_backend
should detect this as a bad case and crash?Should the riak_kv_vnode crash on receiving an error from a co-ordinated PUT attempt?
Should the PUT_FSM select another coordinator if the original one returns an error?
Should there be some pseudo intensity monitoring on restarts of hashtrees and vnode_status_manager to crash the vnode?
Should a better riak_kv health_check be used (one which performed a more genuine check of vnode health), rather than simply disabling the health checks?
Is my understanding incomplete? Are there other mechanisms for detecting node failure (other than outright crash or the disabled health check) that should have prompted recovery from this failure scenario.