Closed erikgrinaker closed 6 months ago
cc @cockroachdb/replication
This is a good step, but will be insufficient when we are sharing disaggregated sstables, which we don't read as part of sending the snapshot (and don't wish to, for fast rebalancing/up-replication etc.).
If we can rely on the filesystem not corrupting anything after the file is written (which I believe we can rely on for object storage), one solution is for each compaction/flush to verify the sstable after it has written it, before declaring the compaction successful. That is, read back the sstable and check all the checksums, block level and per-value. Perhaps for a local filesystem like ext4 with blocks stored in EBS/PD, we can also make such an assumption (internal notes suggest some equivalency between EBS and RAID 6 https://docs.google.com/document/d/1RcZgfj2ptup6gdwyLyVuwDIl9h1SdSY8ZYaC8PQdjnc/edit#bookmark=id.hx6j69aeem62).
@jbowens
one solution is for each compaction/flush to verify the sstable after it has written it
I think we want CRDB-level verification here rather than Pebble-level verification though (i.e. end-to-end roachpb.Value checksums rather than SST checksums). In the motivating case here, the SST was presumably intact (otherwise Pebble would have errored when opening it), so the corruption likely happened below Raft replication but above Pebble. Are you suggesting some sort of hook where CRDB can verify roachpb.Value checksums during or after the SST is written?
@erikgrinaker We’re looking at replacing roachpb.Value checksums with per-KV checksums that are end-to-end and fill existing roachpb.Value checksum gaps (they’re not used for MVCC tombstones, Pebble tombstones, timeseries keys, internal key trailers, etc.). Pebble will have direct support for accepting checksums from Cockroach, persisting them, verifying them in persisted data, and surfacing them back out the other end. We’re still exploring designs and some of the thinking is captured in that google doc and some in #111001.
Ah, I see, promising! I imagine this will still be some ways off, i.e. not shipping in 24.1? In that case, we'll probably want to add roachpb.Value checksum verification in the interim, as a cheap guard.
Hi, I would like to contribute to this issue.
trying to decipher of the information provided - we want to add a verification on roachpb.Value when building snapshots to catch data corruption issues before consistent checker.
from reading the code I gathered these information:
I am trying to figure out where does roachpb.value verification fit in when building the snapshots, I wonder if convert the value from the [visitor] (https://github.com/lyang24/cockroach/blob/aa6d150bd6e25be23d3dafb7432f806bc6bb553c/pkg/kv/kvserver/store_snapshot.go#L777-L818) and convert the []bytes
into roachpb.Value and call the verify
method is in the ball park of the end goal here?
Yeah, that's exactly it. We'll have to be a bit mindful of the performance impact here, but I expect it to be negligible given all the network and disk IO involved in sending and applying snapshots. It may be better to verify the checksums on the receiver though, to catch any corruption in transit, although this seems unlikely given TLS will checksum the network traffic -- could still be random memory bit flips though.
We have a verifyingMVCCIterator
that automatically verifies values, but I think this only works for MVCC iterators -- here, we use an engine iterator. You could consider adapting this to also support engine iterators, if you feel like it, but otherwise it's fine to just do the verification inline during snapshot building too.
The consistency checker will detect individual replica corruption by comparing replica checksums. However, this only runs periodically (every 24 hours, or even rarer in large clusters), so it's possible for corruption to spread to other replicas before the consistency checker runs, and then it can look like the corrupt replicas are in fact the correct ones (because they are now a majority). We think we have seen this happen in a couple of escalations. So the motivation here is to prevent the corruption from spreading via snapshots.
We should also have an escape hatch for this, via an internal cluster setting. We currently only verify these checksums during Scan
requests, so it's possible (though unlikely) that a cluster has latent value corruption. This could prevent snapshots from being sent, which could destabilize the cluster, so there should be a way for operators to disable this if it happens until they can repair the corrupted value.
With storage corruption, it's possible for the corruption to spread via Raft snapshots before the consistency checker detects it. We saw this possibly happen in https://github.com/cockroachlabs/support/issues/2583. To avoid this, we should check roachpb.Value checksums when sending snapshots.
Related to #93519.
Jira issue: CRDB-31510