cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.02k stars 3.79k forks source link

kvserver: Raft processing should handle data loss/corruption gracefully #113147

Open erikgrinaker opened 11 months ago

erikgrinaker commented 11 months ago

Currently, both etcd/raft and CRDB itself will tend to exit with a fatal error if Raft data has been lost or corrupted. This is often seen when running without fsync, either at the OS level or via kv.raft_log.synchronization.unsafe.disabled. The typical failure is e.g. Raft fataling with "state.commit 267 is out of range [523, 728]" or CRDB fatal application failures (#75944), but there is a large number of other fatal errors both in Raft and CRDB, and there are also other failure modes such as append loops (#113053). In principle, there's an unbounded number of failure modes here, given there's an unbounded number of ways data could become lost or corrupted.

We should try to handle these failures more gracefully. Ideally, we would want to isolate the failure to a single replica instead of failing the entire node, cordoning it off and having the allocator upreplicate elsewhere (assuming the entire range isn't faulty). This is a generalization of the planned handling of application failures (#75944).

This is a pre-requisite for allowing users to safely disable fsync (#88442) -- at the very least, we need to detect when a node has lost data and provide a user-friendly error message.

Jira issue: CRDB-32773

blathers-crl[bot] commented 11 months ago

cc @cockroachdb/replication