tonyxuqqi commented 7 months ago

Version: RocksDB 6.29

Expected behavior

Using kPointInTimeRecovery, after EBS restore, the non-latest WAL log file should not be corrupted.

Actual behavior

The non-latest WAL file is corrupted after EBS restore. And we also noticed that after enable track_and_verify_wals_in_manifest, some WAL files' size in manifest does not match the actual one---the actual size is bigger than the one in manifest.

"[2024/03/08 14:14:16.534 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:818] /var/lib/wal/rocksdb/15557239.log: dropping 32768 bytes; Corruption: checksum mismatch [2024/03/08 14:14:16.534 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:818] /var/lib/wal/rocksdb/15557239.log: dropping 64430 bytes; Corruption: error in middle of record [2024/03/08 14:14:16.534 +00:00][4][INFO] [db/db_impl/db_impl_open.cc:1140] Point in time recovered to log #15557239 seq #296119395244 [2024/03/08 14:14:16.534 +00:00][4][INFO] [db/db_impl/db_impl_open.cc:890] Recovering log #15557242 mode 2 [2024/03/08 14:14:16.535 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:896] /var/lib/wal/rocksdb/15557242.log: dropping 36031876 bytes [2024/03/08 14:14:16.535 +00:00][4][INFO] [db/db_impl/db_impl_open.cc:890] Recovering log #15557245 mode 2 [2024/03/08 14:14:16.536 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:896] /var/lib/wal/rocksdb/15557245.log: dropping 35957673 bytes "

Steps to reproduce the behavior

Running some heavy write workload, take the EBS snapshot and then restore it in another machine. Open the rocksdb.

ajkr commented 7 months ago

Could you please provide the setting you have for recycle_log_file_num? Additionally, if you have the full contents of the OPTIONS file used to write the DB, that could potentially give even more clues.

nkg- commented 7 months ago

TiKV uses kPointInTimeRecovery and kAbsoluteConsistency, so it has WAL recycle disabled. ie recycle_log_file_num=0.

As for WriteOptions, I only see 3 options ever being set.

sync: Across writes, set to true (set when internal storage topology like sharding changes) and false (normal application writes)
no_slowdown: false (default. Set sometimes. Not sure if set during this repro)
disable_wal: false

ajkr commented 7 months ago

Thanks for the info.

Does TiKV also use an external syncer thread, that is, setting manual_wal_flush==true + a loop of FlushWAL(true /* sync */)?

Did your LOG file contain any occurrences of "WAL Sync error"?

Do you still have 15557239.log? I can't tell from the error message, but it would be interesting to know if this error is around the end of the log file.

nkg- commented 7 months ago

Does TiKV also use an external syncer thread, that is, setting manual_wal_flush==true + a loop of FlushWAL(true / sync /)?

Great question. I think @tonyxuqqi mentioned tikv does trigger flushWAL manually, except I can't find a code reference. So @tonyxuqqi , if you can confirm.

Did your LOG file contain any occurrences of "WAL Sync error"?

Yeah. We had checked earlier. Unfortunately no "sync error". THat would have a perfect smoking gun.

Do you still have 15557239.log? I can't tell from the error message, but it would be interesting to know if this error is around the end of the log file.

We don't have the log file. Right now. But we 10s of repro per week. How can you tell if corruption is in middle or end. FWIW, what I saw was there were

359 records missing (compared to last readable seqnum in the corrupt wal, and first seqnum in next wal).
Corrupt wal had 76k bytes extra. When compared to the size in the manifest. We have track wal in manifest enabled. https://github.com/facebook/rocksdb/wiki/Track-WAL-in-MANIFEST. Is it possible MANIFEST doesn't get updated, with every corresponding call to fsync WAL.

ajkr commented 7 months ago

This is good info. I think 359 missing records should mean the corruption was near the tail. It might be the number of records between the last FlushWAL(true /* sync */) finished and the WAL ended.

There is one bug fix I see that could be relevant: #10560

I think you said you hit this problem even before turning on WAL tracking? WAL tracking is a good feature, but important bug fixes are needed that probably are not present in version 6.29. #10185 sounds like a possible fix for the symptom you described.

nkg- commented 7 months ago

I think you said you hit this problem even before turning on WAL tracking?

Yes. We enabled to debug the corruption. So corruption was there before.

Both fixes seem very relevant. https://github.com/facebook/rocksdb/pull/10560 particularly seems to match @tonyxuqqi hypothesis we were discussing earlier this week. I can try patching this one, to fix the corruption.

And https://github.com/facebook/rocksdb/pull/10185 to fix the manifest tracking. We indeed see 100s/hour of size mismatches, which don't point to a corrupt WALs.

mittalrishabh commented 7 months ago

@ajkr can rocksdb start writing to a new WAL before previous WALs are fsynced.

ajkr commented 7 months ago

Hi, yes, it can. When next WAL sync is requested, though, we fsync in order from oldest unsynced WAL up to the active WAL

mittalrishabh commented 7 months ago

I think It doesn't mean that latest WAL file is written only after old WAL files. Filesystem can flush the blocks independently. Since the file system's flushing behavior may not guarantee the sequential writing of WAL files, is it safe to assume that the tail of the non latest WAL file is always completely written in case of unclean shutdown ?

ajkr commented 7 months ago

Yes you are right. As you described it, there is a valid scenario where point-in-time recovery encounters "Corruption: checksum mismatch" and does not proceed to recover the most recent log.

However, this scenario should not be enough to trigger "Corruption: SST file is ahead of WALs in CF lock". There is another fsync procedure to prevent that, which is we fsync the WALs whose data was flushed before recording the SST file to the MANIFEST.

mittalrishabh commented 7 months ago

Thanks, It makes sense.

mittalrishabh commented 7 months ago

Hi Andrew, I have put some comments on the PR https://github.com/facebook/rocksdb/pull/10560.

ajkr commented 7 months ago

Replied there - feel free to follow up there if I missed something

mittalrishabh commented 7 months ago

@ajkr We already have this https://github.com/facebook/rocksdb/pull/10185 in our TiKV branch. It appears that PingCAP cherry-picked a few changes after forking from the 6.29 release. I have attached a screenshot of the cherry-picked changes for your reference. Based on the current code, it seems that there might be a bug in tracking wal size in manifest, or there could be missing fsync calls. I am cherrypicking the other change https://github.com/facebook/rocksdb/pull/10560 to see if it resolves the issue.

nkg- commented 7 months ago

+1 to @mittalrishabh said. We are cherrypicking the fsync fix https://github.com/facebook/rocksdb/pull/10560. We will update here, if it fixes the WAL corruption.

For WAL size mismatch with manifest, as @mittalrishabh said, its already included, and we still see LOTs (every rocksdb instance shows 1 mismatch per hour) of mismatches. So wondering if its possible there is another bug in WAL size tracking. @ajkr : Is there anything we can provide to help you debug that.

ajkr commented 7 months ago

some WAL files' size in manifest does not match the actual one---the actual size is bigger than the one in manifest.

Do you have an example error message? I am not sure if I'm reading something backwards or looking in the wrong place. The code I am looking at is this - https://github.com/facebook/rocksdb/blob/18567348214e80edba54a12b21b8a2524e8e4c72/db/wal_edit.cc#L198-L205

For that check at least, actual WAL size bigger than the manifest size should be tolerated.

nkg- commented 6 months ago

@ajkr : Our size mismatch detection is not based on that check, you referenced above. But rather, we have a script, which runs every 2 mins, an on every storage node (running rocksdb), it checks the current WALs (except latest one), and compares the actual file size, with the file size in manifest.

So a question:

Does the above comparison make sense, wrt does it point to something unexpected.
More specifically, does this size mismatch point to any corruption, or a missing fsync.
Or maybe mismatch bet actual file and size in manifest is somewhat expected, since the code doesn't track size in manifest, for every file fsync.

In general, we just want to know if size mismatch points to something nefarious (like another bug) at all.

ajkr commented 6 months ago

The short answer is, no, what you're checking is not guaranteed.

Longer answer: the synced size recorded in the MANIFEST is just a lower bound on the latest synced offset. The actual latest synced offset (and thus file size) may be larger. It was originally built like this because we also recorded synced size for the active WAL, which is continually growing. Currently track_and_verify_wals_in_manifest only takes effect for the inactive WALs so I agree it is confusing that we don't simply record a final size after a final sync. I hope that we bring this feature back for active WAL again some day, though, so don't want to specialize it for the case of inactive WALs.

nkg- commented 6 months ago

Currently track_and_verify_wals_in_manifest only takes effect for the inactive WALs so I agree it is confusing that we don't simply record a final size after a final sync.

Just confirming, even for inactive (what we called non-latest in OP) WALs, it (size matching) is not guaranteed. Would it 'eventually' match after your fix in https://github.com/facebook/rocksdb/pull/10560/files#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R1541. It seems so, since we don't stop tracking WALs, until presync size and flushed size are equal. I understand they might NOT match for a short duration, bet finishsync (line, 1547) and addWAL (line 1539) in next call to markWALSynced, which might be a few seconds.

I hope that we bring this feature back for active WAL again some day, though, so don't want to specialize it for the case of inactive WALs.

What do you mean by special casing for inactive WALs. Writing both preSyncSize and postSyncSize seems reasonable for all WALs.

ajkr commented 6 months ago

Yes I agree the MANIFEST synced size should eventually match the actual synced size (with that fix).

What do you mean by special casing for inactive WALs

Currently this feature records pre-sync sizes upon sync completing for inactive WALs only. Say a WAL is synced N times in its lifetime. It is active the first (N-M) times a sync completes, and inactive for the final M sync completions (I think M must be either 1 or 2...). This feature records the final M syncs to the MANIFEST, which feels odd. My intuition is that the feature should either (a) record all N syncs to the MANIFEST, or (b) record only the final 1 sync to the MANIFEST. (b) is what I meant would special case the feature for inactive WALs, but I hope we do (a) instead.

mittalrishabh commented 6 months ago

@ajkr We are trying to reproduce this issue in our setup. For faster reproduction, instead of doing the full recovery, we are only doing WAL dump and depending on " "Corruption: checksum mismatch" message to see if it is reproducible.

Based on the missing fsync bug https://github.com/facebook/rocksdb/pull/10560 , it seems that it could lead to missing records too. What would be the error message if rocksdb detect missing records.

ajkr commented 6 months ago

There isn't a specific error message for missing records because we don't have a way to detect that in general. (Extra detail about how we're thinking about doing it) We could start requiring consecutive WALs to have consecutive sequence numbers. To prevent disableWAL=true writes from getting in the way, we could begin every WAL with a dummy entry with sequence number consecutive to that of the final record in the previous WAL.

Lacking that, do you have a way to check for holes in the recovered data? Verifying that in our test tooling is actually how I found that bug in the first place - https://rocksdb.org/blog/2022/10/05/lost-buffered-write-recovery.html. I think applications storing consensus logs in RocksDB may be able to detect a hole using their log sequence numbers.

mittalrishabh commented 6 months ago

Got it. In TiDB, writes from multiple raft logs are asynchronously written to RocksDb. I don't think we can use it to detect holes. Is there a github issue open to detect the missing records. I think there could be other solutions to detect it.

ajkr commented 6 months ago

There isn't a specific error message for missing records because we don't have a way to detect that in general

There are some ways that can catch recovery not recovering far enough. By "far enough", I mean recovery should at least reach the last write that completed before the last successful WAL sync started. At the RocksDB level, the main detection would be the "SST file is ahead of WALs" error message. At the application level, this scenario could probably be detected more easily than a hole. I recall in Cockroach it caused some sort of Raft log panic.

nkg- commented 6 months ago

Going back to the original issue. I'm still wondering, if the corruption we see, is indeed fixed by https://github.com/facebook/rocksdb/pull/10560. Its a bit hard for us to confirm, since we can only repro in production, which happens rarely (2% of disk snapshots, across all storage nodes had a single corrupt WAL).

I have a few questions:

For https://github.com/facebook/rocksdb/pull/10560, what is trigger scenario for the bug. Does it need an explicit flushWAL (public API) to be called. For our use-case, we just do put. By default sync=false. For some critical writes, sync=true. Would the race condition between. WAL append and switch (I'm assuming this is triggered by memtable flush) can both happen between FlushWAL(true / sync /)'s sync operations and its call to MarkLogsSynced().
If there is a missing fsync, then wouldn't it cause a completely missing record. But we always see partial record. Its not intuitive to me, what is causing partial record in a WAL, compared to missing record in WAL.
Under what conditions does rocksdb do a WAL swap. Possible reasons, i) memtable is full, ii) memtable flush call. Anything else.
In our test environment, we just reproed another Corruption (let me know if we should file a separate issue). Corruption detected in log file Corruption: missing start of fragmented record(2). Can you share how its different than Corruption: checksum mismatch, and if it has different trigger conditions. The repro condition was this corruption was: a) flushWAL (every 100ms), b) reducing memtable size (500 KB) triggering memtable flush/wal swap every 100ms, for our write traffic. FWIW, the #10560 fixed this corruption, under the same repro conditions.

nkg- commented 6 months ago

Couple of other questions:

Lets say, we have the fix #10560, now in the case of manual fsync, is it possible to have a write in non-active WAL which is not fsynced, until flushWAL(sync=true) or put(sync=true) is called. Or does wal swap (during memtable flush) guarantee, the swaped WAL is fsynced with all writes.
A more basic question, WHen memtable is full, and flushed, why do we swap the WAL. Compared to size based WAL rotation.

ajkr commented 6 months ago

Going back to the original issue. I'm still wondering, if the corruption we see, is indeed fixed by #10560. Its a bit hard for us to confirm, since we can only repro in production, which happens rarely (2% of disk snapshots, across all storage nodes had a single corrupt WAL).

I have a few questions:

For Ensure writes to WAL tail during FlushWAL(true /* sync */) will be synced #10560, what is trigger scenario for the bug. Does it need an explicit flushWAL (public API) to be called. For our use-case, we just do put. By default sync=false. For some critical writes, sync=true. Would the race condition between. WAL append and switch (I'm assuming this is triggered by memtable flush) can both happen between FlushWAL(true / sync /)'s sync operations and its call to MarkLogsSynced().

OK, I revisited #10560. I think it cannot be triggered by a workload that simply writes with varying values for WriteOptions::sync. That is because we atomically perform the following steps: (1) appending to the active WAL, (2) syncing the WALs (including inactive ones with unsynced data at the tail), and (3) recording the synced WALs. ((2) and (3) are skipped when WriteOptions::sync == false.) The way those steps are atomic is by doing them all in our write group leader.

10560's race condition required writes to happen between (2) and (3) causing buggy logic in (3) to incorrectly treat an inactive WAL as fully synced when it actually wasn't. `FlushWAL(true /* sync */)` does (2) and (3) in a non write group leader thread, which made it vulnerable.

That doesn't mean the workload needs an explicit FlushWAL(true /* sync */) to trigger the bug, however. There are also internal callers of FlushWAL(true /* sync */), and potentially other things internally that do (2) and (3) in a non write group leader thread.

Based on that criteria, other vulnerable APIs I found were SyncWAL(), GetLiveFilesStorageInfo(), CreateNewBackup*(), CreateCheckpoint().

By the way, are you able to share more about what options you use? From the OPTIONS file I'd be interested in manual_wal_flush, enable_pipelined_write, wal_bytes_per_sync, two_write_queues, track_and_verify_wals_in_manifest. If there's other WriteOptions set besides sync that could be interesting too.

facebook / rocksdb

non-latest WAL log corruption after EBS snapshot restore in AWS #12467

Expected behavior

Actual behavior

Steps to reproduce the behavior

10560's race condition required writes to happen between (2) and (3) causing buggy logic in (3) to incorrectly treat an inactive WAL as fully synced when it actually wasn't. `FlushWAL(true /* sync */)` does (2) and (3) in a non write group leader thread, which made it vulnerable.

facebook / rocksdb

non-latest WAL log corruption after EBS snapshot restore in AWS #12467

Expected behavior

Actual behavior

Steps to reproduce the behavior

10560's race condition required writes to happen between (2) and (3) causing buggy logic in (3) to incorrectly treat an inactive WAL as fully synced when it actually wasn't. FlushWAL(true /* sync */) does (2) and (3) in a non write group leader thread, which made it vulnerable.

10560's race condition required writes to happen between (2) and (3) causing buggy logic in (3) to incorrectly treat an inactive WAL as fully synced when it actually wasn't. `FlushWAL(true /* sync */)` does (2) and (3) in a non write group leader thread, which made it vulnerable.