cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.66k stars 430 forks source link

WAL corruptions are silently ignored #453

Open lni opened 4 years ago

lni commented 4 years ago

When replaying WAL, ErrInvalidChunk is returned if the recorded CRC doesn't match the data (say caused by disk hardware), code is here.

It causes the replayWAL method to stop processing further WAL records without notifying the user of such corruption, code is here.

I'd expect pebble.Open() to return an error for such corruption.

Jira issue: PEBBLE-179

petermattis commented 4 years ago

@lni You are correct about the Pebble behavior. This is intentional and I believe matches the behavior of RocksDB's kPointInTimeRecovery WALRecoveryMode:

  // Recover to point-in-time consistency (default)
  // We stop the WAL playback on discovering WAL inconsistency
  // Use case : Ideal for systems that have disk controller cache like
  // hard disk, SSD without super capacitor that store related data
  kPointInTimeRecovery = 0x02,

I'd expect pebble.Open() to return an error for such corruption.

Note that this case can occur without actual hardware corruption due to WAL recycling.

lni commented 4 years ago

@petermattis thanks for the quick response!

Will you use pebble to store Raft logs in CockroachDB? For such use case I guess RocksDB's kAbsoluteConsistency mode is required?

petermattis commented 4 years ago

Will you use pebble to store Raft logs in CockroachDB?

Yes.

For such use case I guess RocksDB's kAbsoluteConsistency mode is required?

Why? kAbsoluteConsistency requires that the DB was "shut down cleanly". Using it would prevent a DB from recovering if a node crashed. The requirement from Raft is that when a piece of data is reported as synced, it doesn't disappear. That is what kPointInTimeRecovery accomplishes in the absence of hardware corruption. kAbsoluteConsistency is providing something much stricter: there are zero errors when recovering the WAL. That means the previous incarnation could not have crashed in the middle of writing to the WAL, which is very high bar to clear.

lni commented 4 years ago

The requirement from Raft is that when a piece of data is reported as synced, it doesn't disappear. That is what kPointInTimeRecovery accomplishes in the absence of hardware corruption.

Thanks for the detailed explanation. I spent some time reading various papers/articles on this topic, it is more complicated than what I previously knew.

My concern is still the hardware corruption part - all records after a corrupted record (say a bit flip) are silently dropped and the corruption itself is never reported. This is going to cause problems for Raft.

ajkr commented 4 years ago

My concern is still the hardware corruption part - all records after a corrupted record (say a bit flip) are silently dropped and the corruption itself is never reported. This is going to cause problems for Raft.

It's a good observation. FWIW, RocksDB's kTolerateCorruptedTailRecords, despite its name, only tolerates some corrupted tail records. In particular, it tolerates those that imply they were unfinished due to a crash, like the WAL ends with a partial record. However it is not flexible. Some errors, like checksum corruption, might be caused by a crash during write or might be caused by a bit flip. In those cases, kTolerateCorruptedTailRecords acts conservatively and returns an error, whereas kPointInTimeRecovery logs a warning and returns success.

ajkr commented 4 years ago

kPointInTimeRecovery seems ideal, IMO, as long as there's a check to ensure recovery didn't lead to an inconsistent state (not sure exactly what this means for Raft), and there's a way to replicate the lost data. Lacking either of those, kTolerateCorruptedTailRecords with WAL recycling disabled seems safer as DBs with potentially inconsistent/missing state should be thrown away. I'm not that familiar with Raft to know whether the conditions are met.

lni commented 4 years ago

@ajkr Thanks for the info.

I think kPointInTimeRecovery is not ideal for Raft. According to SSD vendor specs, uncorrectable bit error happens at certain advertised rate (e.g. 1 sector per 1^17 bits read for Intel enterprise SSDs). This can cause some Raft log entries reported as persisted to disappear after a reboot. As mentioned by @petermattis above, that directly violates a core requirement of Raft.

Any plan to support something similar to kTolerateCorruptedTailRecords as an option?

petermattis commented 4 years ago

@lni Raft lives in a blissful world where data that has been written durably to disk is never later corrupted. CockroachDB adds additional verification of the consistency of Raft replicas on top of Raft, performing these consistency checks once every 24 hours. The story of what happens when there is an inconsistency is still a bit lacking, though an attempt is made to evict the replica in the minority and consider the majority to be correct. This isn't a Raft-level concept, but something CRDB built on top of Raft.

While kTolerateCorruptedTailRecords can allow recovery of more data in the RocksDB/Pebble WAL, it doesn't necessarily improve the store for Raft. Raft expects any committed log entry to not disappear. If an entry becomes corrupted and is no longer readable, it isn't clear at all what should be done with the subsequent entries. In CRDB, the Raft log contains KV operations. Missing one of those operations would lead to corruption of the KV state.

Recovery of the RocksDB/Pebble WAL is also only one aspect of Raft replica state. For the most part, the RockDB/Pebble WAL is never read as records are written to it and also written to sstables. Corruption of a Raft log record written to an sstable is the more likely scenario given that the sstable data is rewritten many times during compactions and there is only a small period of time where a record will only be written to the WAL. The Raft library we're using (etcd/raft) doesn't know how to handle a single corrupted Raft log entry, so instead we indicate that the Raft log is truncated at the point. Could something better be done here? Possibly, though it makes the recovery mechanism of the consensus algorithm more complex than it already is.

Any plan to support something similar to kTolerateCorruptedTailRecords as an option?

Not at this time. We wouldn't want to add support for it until we had a plan to use it for CRDB. I don't currently see how we'd use kTolerateCorruptedTailRecords for CRDB, though I will pass along the idea that perhaps there is something better to be done with recovery from disk-level corruption.

sumeerbhola commented 4 years ago

@petermattis Regarding just the WAL, where we can't distinguish between a harmless corruption and a harmful one, I wonder if the story becomes slightly simpler when we separate the (Pebble/Rocsk)DB instances that store the state machine with no WAL, and the raft state, which will have a WAL.

The only raft state in the state-machine-db will be the RaftAppliedIndex, which lags the raft entries that this replica knows about. The raft-db may have some part of its WAL corrupted and then the node crashes -- when the node recovers, kPointInTimeRecovery will make it appear to have a shorter raft log. I briefly discussed with @ajwerner: (a) the simpler part here is to talk to the raft leader and if it thinks this replica should have more entries it gets these new entries from the leader, (b) the difficult (impossible?) part is that these missing entries at this replica were needed to make those entries committed -- but this should only be an issue if the leader is new and needs the state of this recovering replica, and so unless we have correlated failures we should be ok?

petermattis commented 4 years ago

so unless we have correlated failures we should be ok?

I'm not seeing any problems with this logic. I'll think about this more, though.

lni commented 4 years ago

@petermattis Thanks a lot for the detailed reply above, I spent a little bit more time to dig deeper on this.

I just read this FAST2018 paper on how to handle & recover from raft log corruptions. It covers quite a few issues we discussed here in this thread.

It also has a taxonomy on possible responses to corruptions, kPointInTimeRecovery is basically the 'truncate' approach described in the paper and the paper provided an example (figure 2) on how such approach can cause committed entries to disappear from the entire cluster. 'Truncate' is thus marked as unsafe.

I won't worry too much on how to recover from any hardware corrupted log. When there is a checksum mismatch that can be proven to be actual hardware data corruption (e.g. when sync write = true and the next record is complete and valid), I just want to be notified so I can at least crash to ensure safety.

We wouldn't want to add support for it until we had a plan to use it for CRDB.

For PRs relate to new features that won't be used by CRDB, will they be considered & accepted?

petermattis commented 4 years ago

The FAST paper is a good read and I think it highlights that the work involve traverses both the storage layer and the replication layer. Naively skipping over corrupted log records at the storage layer is extremely problematic (which is what kTolerateCorruptedTailRecords does) can cause problems for the replication layer. Requiring absolute consistency in the log is also problematic. Incorporating protocol-aware recovery is on CockroachDB's radar, but not something that has been prioritized yet.

I won't worry too much on how to recover from any hardware corrupted log. When there is a checksum mismatch that can be proven to be actual hardware data corruption (e.g. when sync write = true and the next record is complete and valid), I just want to be notified so I can at least crash to ensure safety.

Do you have a proposal for how to do this? The FAST paper uses knowledge at the replication layer to make this determination.

For PRs relate to new features that won't be used by CRDB, will they be considered & accepted?

From the README: Pebble intentionally does not aspire to include every feature in RocksDB and is specifically targetting the use case and feature set needed by CockroachDB. I think this can be generalized that there is a reluctance to add functionality that won't be used by CockroachDB. Why? Each additional feature and bit of functionality adds maintenance burden. The cross-section of all of the features in RocksDB has grown to be enormous and there are often surprising incompatibilities and gotchas.

lni commented 4 years ago

Naively skipping over corrupted log records at the storage layer is extremely problematic (which is what kTolerateCorruptedTailRecords does)

My understanding is that kPointInTimeRecovery silently truncates the WAL from any hardware corrupted record while kTolerateCorruptedTailRecords returns an error on such hardware corrupted record. As explained by @ajkr above, kTolerateCorruptedTailRecords only skips partially written tail record at the end of the WAL caused by a crash. Anything I missed here?

The FAST paper uses knowledge at the replication layer to make this determination.

Section 3.3.3 "Disentangling Crashes and Corruption in Log" of the paper describes how hardware corruptions is determined in the storage layer. Could you please shed some light on which part of the paper uses replication layer info to determine hardware corruption?

I think this can be generalized that there is a reluctance to add functionality that won't be used by CockroachDB.

Fair enough. Thanks for the info and this wonderful project!

bra-fsn commented 4 years ago

Sligthly related to this conversation: @petermattis what happens if we use CRDB (be it on Rocksdb or Pebble) on a checksumming file system (like ZFS), which will return an error if something wants to read a bad block (instead of returning corrupted data)?

github-actions[bot] commented 2 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!

sumeerbhola commented 2 years ago

TODO: try to summarize this discussion here, or in another issue. It isn't clear we want to do anything here, but I've swapped out most state on this discussion and there is enough here to (re)consume that we shouldn't close this issue yet.

nicktrav commented 2 years ago

the story becomes slightly simpler when we separate the (Pebble/Rocsk)DB instances that store the state machine with no WAL, and the raft state, which will have a WAL.

@sumeerbhola - given your recent work / investigation in this space, I'm wondering if you have an inclination one way or the other?

cc: @tbg @erikgrinaker for visibility, in case there are strong feelings here from a CRDB replication perspective.

jbowens commented 1 year ago

There's been one development since this discussion was active: Since #871, Pebble has begun requiring closed WALs to have clean tails. This guards against corruption in WALs other than the last active one, but does nothing for the most recent WAL. In the most recent WAL, Pebble continues to assume an invalid entry is the end the log.

As I understand it, there's two broad categories of solutions discussed here:

  1. Replication independent: The relevant part of the linked FAST2018 paper suggests introducing a second write() to the WAL-write code path. If I understand correctly, the new write saves a small record to a secondary, parallel log file recording the expectation of the commit of the log entry in the primary log file. This secondary log file can be used on WAL replay to detect corruption in all log entries except the most recently written entry.
  2. Replication dependent: After replay, we can detect whether we're missing Raft log entries due to a WAL that was truncated due to corruption.
jbowens commented 1 year ago

In Cockroach, we do expect missing raft entries to be detected promptly by the Raft implementation, which makes this not a priority from Cockroach's perspective.

tbg commented 1 year ago

In Cockroach, we do expect missing raft entries to be detected promptly by the Raft implementation, which makes this not a priority from Cockroach's perspective.

How so? The replication layer relies on durability of the storage engine. Silently discarded data would not necessarily be detected and could lead to basically any split brain scenario if we're exceptionally unlucky.

jbowens commented 1 year ago

This was derived from Ben’s slack comment. I don’t personally have any context on how likely it is for dropped raft log entries to go undetected.

sumeerbhola commented 1 year ago

@tbg split brain because of raft HardState loss? If we only lost raft log entries, am I correct in assuming that we would quickly detect this, so it would not be silent corruption? And for the state machine we can afford to lose a suffix, since we can reapply from the raft log.

tbg commented 1 year ago

Maybe I'm misunderstanding the question, but if a replica loses a tail of raft log entries, this is a classic violation of durability and wouldn't necessarily be detectable. For the most unlikely but most clear example, if all three replicas crash and lose the last log entry 100, then who should detect that? The entry is gone even though it was acked. But even the loss of a vote on a single node can cause issues. If that node forgets its vote via an ill-timed crash, it can vote again. It can then help elect two leaders for the same term. In an academic sense, loss of durability is not generally detectable and pretty much anything follows from it if you allow for unfortunate circumstances.

Note that the ideas in https://github.com/cockroachdb/cockroach/issues/88442 are applicable here. We could say that if there's WAL tail corruption, we put all voters on this node into the mode described in the issue until they're known to have all indexes that they could've possibly acked in their previous life.

jbowens commented 1 year ago

We could say that if there's WAL tail corruption, we put all voters on this node into the mode described in the issue until they're known to have all indexes that they could've possibly acked in their previous life.

Just to clarify a bit, we don't know when there's WAL tail corruption. We can't distinguish between WAL tail corruption and an in-progress but incomplete write. Maybe that's exactly what you're saying? After every crash, on boot all nodes' voters enter the described mode? Clean exits that closed the WAL by emitting an EOF trailer could skip the process on boot, because it's unambiguous where the true end of the WAL is.

tbg commented 1 year ago

See also:

image

https://www.snia.org/sites/default/files/SDC/2018/presentations/DPCO/Alagappan_Ramnatthan_Protocol_Aware_Recovery_Consensus_Based_Storage.pdf

Judging from what was written above, it sounds as though pebble would stop replaying the WAL if there's a corrupted block somewhere in the middle of the committed part of the WAL. That could have bad consequences; are we sure we don't want to at least fail in such cases?

jbowens commented 1 year ago

With WAL reuse, there’s no way to know what the committed part of the WAL is. There are ways we can reduce the number scenarios where corruption goes undetected, but given that they can’t eliminate them, I think we should leverage replication.

jbowens commented 1 year ago

If we began padding WAL writes to block boundaries, like we would do if we began using direct I/O, we could read ahead to see if any future blocks entries successfully decode to a log entry encoding the same log number. If they do, we've detected corruption in the middle of the WAL. This still does nothing for the case where a section of the tail of the WAL is corrupted, in which case we can't distinguish between an in-progress, unacknowledged write and a committed, acknowledged but corrupted write.

To detect that without internode communication would require a second synchronous fsync, I think.

jbowens commented 5 months ago

we could read ahead to see if any future blocks entries successfully decode to a log entry encoding the same log number. If they do, we've detected corruption in the middle of the WAL.

I no longer think this is possible in the cloud environment as written. My understanding is that on at least one cloud provider, writes to blocks N and blocks N+1 in parallel will not necessarily complete in order; block N+1 may be persisted, while block N is not. We'd need to include in blocks some record of the offset to which we believe we've already durably synced. Then if we read ahead and observed an intact block that claimed we durably synced up to a higher offset than we successfully replayed, we know corruption occurred.