cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

storage,kv: tolerate corruption of sideloaded sstables #91029

Open jbowens opened 1 year ago

jbowens commented 1 year ago

If an AddSSTable's sstable is sideloaded and becomes corrupted (eg, due to a bad disk), the operator has no recourse other than to replace the node.

This issue is intended to track isolation of corruption of the raft log / sideloaded sstables, in contrast to #67568 which tracks recovery from corruption of already-applied state.

See #90834 for an example.

Jira issue: CRDB-21080

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

erikgrinaker commented 1 year ago

This is related to #75903, in that any failure to apply a Raft command will crash the node anyway. The proposed solution there is to cordon the replica (and then discard the faulty replica and upreplicate elsewhere, unless all replicas are faulty), which is likely the preferable approach here as well. That said, if the SST is corrupt then the disks are likely faulty, so we may not want to keep the node running and risk further corruption anyway.

github-actions[bot] commented 5 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!