Fix Checkpoint hard link of inactive but unsynced WAL

pdillinger commented 4 months ago

Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the logs_ list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in #10083, it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically:

Detect which WALs might not be fully synced, according to our DBImpl metadata, and prevent hard linking those (with trim_to_size=true from GetLiveFilesStorageInfo(). And while we're at it, use our known flushed sizes for those WALs.
To avoid a race between that and GetSortedWalFiles(), track a maximum needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
Because of the level of consistency provided by those two, we no longer need to consider syncing as part of the FlushWAL in GetLiveFilesStorageInfo. (We determine the max WAL number consistent with the manifest file size, while holding DB mutex. Should make track_and_verify_wals_in_manifest happy.) This makes the premise of test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback no longer hit) so the test is removed, with crash test as backstop for related issues. See #10185

Stacked on #12729

Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency.

pdillinger commented 4 months ago

Grr, unresolved issues showing up in crash test.

facebook-github-bot commented 4 months ago

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger commented 4 months ago

Passed 3 hours of blackbox_crash_test with amplified checkpoint and backup, so should be good for review

pdillinger commented 4 months ago

Why don't we sync these WAL files and then hard link them?

Hmm, you're probably right that I could have done SyncClosedWals before the final GetSortedWalFiles, but I believe the approach in this PR is a step closer to the goal of not relying on filesystem queries for WAL info in GetLiveFilesStorageInfo.

Also, GetLiveFilesStorageInfo() can be used aside from Checkpoint and Backup, e.g. for statistical purposes. Under that broader set of purposes, it should be minimally blocking when asked not to flush memtable.

facebook-github-bot commented 4 months ago

@pdillinger merged this pull request in facebook/rocksdb@98393f0139fc9529d5c56a4b43bc7a245b22f734.

facebook / rocksdb

Fix Checkpoint hard link of inactive but unsynced WAL #12731