Currently, both etcd/raft and CRDB itself will tend to exit with a fatal error if Raft data has been lost or corrupted. This is often seen when running without fsync, either at the OS level or via kv.raft_log.synchronization.unsafe.disabled. The typical failure is e.g. Raft fataling with "state.commit 267 is out of range [523, 728]" or CRDB fatal application failures (#75944), but there is a large number of other fatal errors both in Raft and CRDB, and there are also other failure modes such as append loops (#113053). In principle, there's an unbounded number of failure modes here, given there's an unbounded number of ways data could become lost or corrupted.
We should try to handle these failures more gracefully. Ideally, we would want to isolate the failure to a single replica instead of failing the entire node, cordoning it off and having the allocator upreplicate elsewhere (assuming the entire range isn't faulty). This is a generalization of the planned handling of application failures (#75944).
This is a pre-requisite for allowing users to safely disable fsync (#88442) -- at the very least, we need to detect when a node has lost data and provide a user-friendly error message.
Currently, both etcd/raft and CRDB itself will tend to exit with a fatal error if Raft data has been lost or corrupted. This is often seen when running without fsync, either at the OS level or via
kv.raft_log.synchronization.unsafe.disabled
. The typical failure is e.g. Raft fataling with "state.commit 267 is out of range [523, 728]" or CRDB fatal application failures (#75944), but there is a large number of other fatal errors both in Raft and CRDB, and there are also other failure modes such as append loops (#113053). In principle, there's an unbounded number of failure modes here, given there's an unbounded number of ways data could become lost or corrupted.We should try to handle these failures more gracefully. Ideally, we would want to isolate the failure to a single replica instead of failing the entire node, cordoning it off and having the allocator upreplicate elsewhere (assuming the entire range isn't faulty). This is a generalization of the planned handling of application failures (#75944).
This is a pre-requisite for allowing users to safely disable fsync (#88442) -- at the very least, we need to detect when a node has lost data and provide a user-friendly error message.
Jira issue: CRDB-32773