facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.39k stars 6.29k forks source link

Fix Checkpoint hard link of inactive but unsynced WAL #12731

Closed pdillinger closed 4 months ago

pdillinger commented 4 months ago

Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the logs_ list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in #10083, it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically:

Stacked on #12729

Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency.

pdillinger commented 4 months ago

Grr, unresolved issues showing up in crash test.

facebook-github-bot commented 4 months ago

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger commented 4 months ago

Passed 3 hours of blackbox_crash_test with amplified checkpoint and backup, so should be good for review

pdillinger commented 4 months ago

Why don't we sync these WAL files and then hard link them?

Hmm, you're probably right that I could have done SyncClosedWals before the final GetSortedWalFiles, but I believe the approach in this PR is a step closer to the goal of not relying on filesystem queries for WAL info in GetLiveFilesStorageInfo.

Also, GetLiveFilesStorageInfo() can be used aside from Checkpoint and Backup, e.g. for statistical purposes. Under that broader set of purposes, it should be minimally blocking when asked not to flush memtable.

facebook-github-bot commented 4 months ago

@pdillinger merged this pull request in facebook/rocksdb@98393f0139fc9529d5c56a4b43bc7a245b22f734.