cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.05k stars 3.8k forks source link

kvserver: verify roachpb.Value checksums when building snapshots #110572

Closed erikgrinaker closed 6 months ago

erikgrinaker commented 1 year ago

With storage corruption, it's possible for the corruption to spread via Raft snapshots before the consistency checker detects it. We saw this possibly happen in https://github.com/cockroachlabs/support/issues/2583. To avoid this, we should check roachpb.Value checksums when sending snapshots.

Related to #93519.

Jira issue: CRDB-31510

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

sumeerbhola commented 1 year ago

This is a good step, but will be insufficient when we are sharing disaggregated sstables, which we don't read as part of sending the snapshot (and don't wish to, for fast rebalancing/up-replication etc.).

If we can rely on the filesystem not corrupting anything after the file is written (which I believe we can rely on for object storage), one solution is for each compaction/flush to verify the sstable after it has written it, before declaring the compaction successful. That is, read back the sstable and check all the checksums, block level and per-value. Perhaps for a local filesystem like ext4 with blocks stored in EBS/PD, we can also make such an assumption (internal notes suggest some equivalency between EBS and RAID 6 https://docs.google.com/document/d/1RcZgfj2ptup6gdwyLyVuwDIl9h1SdSY8ZYaC8PQdjnc/edit#bookmark=id.hx6j69aeem62).

@jbowens

erikgrinaker commented 1 year ago

one solution is for each compaction/flush to verify the sstable after it has written it

I think we want CRDB-level verification here rather than Pebble-level verification though (i.e. end-to-end roachpb.Value checksums rather than SST checksums). In the motivating case here, the SST was presumably intact (otherwise Pebble would have errored when opening it), so the corruption likely happened below Raft replication but above Pebble. Are you suggesting some sort of hook where CRDB can verify roachpb.Value checksums during or after the SST is written?

jbowens commented 1 year ago

@erikgrinaker We’re looking at replacing roachpb.Value checksums with per-KV checksums that are end-to-end and fill existing roachpb.Value checksum gaps (they’re not used for MVCC tombstones, Pebble tombstones, timeseries keys, internal key trailers, etc.). Pebble will have direct support for accepting checksums from Cockroach, persisting them, verifying them in persisted data, and surfacing them back out the other end. We’re still exploring designs and some of the thinking is captured in that google doc and some in #111001.

erikgrinaker commented 1 year ago

Ah, I see, promising! I imagine this will still be some ways off, i.e. not shipping in 24.1? In that case, we'll probably want to add roachpb.Value checksum verification in the interim, as a cheap guard.

lyang24 commented 8 months ago

Hi, I would like to contribute to this issue.

trying to decipher of the information provided - we want to add a verification on roachpb.Value when building snapshots to catch data corruption issues before consistent checker.

from reading the code I gathered these information:

I am trying to figure out where does roachpb.value verification fit in when building the snapshots, I wonder if convert the value from the [visitor] (https://github.com/lyang24/cockroach/blob/aa6d150bd6e25be23d3dafb7432f806bc6bb553c/pkg/kv/kvserver/store_snapshot.go#L777-L818) and convert the []bytes into roachpb.Value and call the verify method is in the ball park of the end goal here?

erikgrinaker commented 8 months ago

Yeah, that's exactly it. We'll have to be a bit mindful of the performance impact here, but I expect it to be negligible given all the network and disk IO involved in sending and applying snapshots. It may be better to verify the checksums on the receiver though, to catch any corruption in transit, although this seems unlikely given TLS will checksum the network traffic -- could still be random memory bit flips though.

We have a verifyingMVCCIterator that automatically verifies values, but I think this only works for MVCC iterators -- here, we use an engine iterator. You could consider adapting this to also support engine iterators, if you feel like it, but otherwise it's fine to just do the verification inline during snapshot building too.

The consistency checker will detect individual replica corruption by comparing replica checksums. However, this only runs periodically (every 24 hours, or even rarer in large clusters), so it's possible for corruption to spread to other replicas before the consistency checker runs, and then it can look like the corrupt replicas are in fact the correct ones (because they are now a majority). We think we have seen this happen in a couple of escalations. So the motivation here is to prevent the corruption from spreading via snapshots.

We should also have an escape hatch for this, via an internal cluster setting. We currently only verify these checksums during Scan requests, so it's possible (though unlikely) that a cluster has latent value corruption. This could prevent snapshots from being sent, which could destabilize the cluster, so there should be a way for operators to disable this if it happens until they can repair the corrupted value.