Some checkpoints cannot be opened with `kAbsoluteConsistency` WAL recovery mode

Expected behavior

Database can be opened from a checkpoint with wal_recovery_mode=kAbsoluteConsistency

Actual behavior

Due to a few data race issues, sometimes active WAL file gets copied in inconsistent state. Database open fails with one of these errors when wal_recovery_mode=kAbsoluteConsistency:

Corruption: truncated record body
Corruption: error reading trailing data

Steps to reproduce the behavior

Initially I wrote this heavy and flaky test, which sometimes reproduces this issue:


TEST_F(CheckpointTest, WalCorruption) {
  Options options = CurrentOptions();
  options.wal_recovery_mode = WALRecoveryMode::kAbsoluteConsistency;

  Reopen(options);

  const auto threads_num = 32;
  const auto checkpoints_to_create = 200;
  std::atomic<int> thread_num(0);
  std::vector<port::Thread> threads;
  port::RWMutex mutex;
  bool finished = false;

  std::function<void()> write_func = [&]() {
    int a = thread_num.fetch_add(1);
    bool stop_worker = false;

    while (!stop_worker) {
      for (auto i = 0; i < 10000; ++i) {
        std::string key = "foo" + std::to_string(a) + "_" + std::to_string(i);
        ASSERT_OK(Put(key, "bar"));
      }

      mutex.ReadLock();
      stop_worker = finished;
      mutex.ReadUnlock();
    }
  };

  for (auto i = 0; i < threads_num; ++i) {
    threads.emplace_back(write_func);
  }

  std::vector<std::string> snapshot_names;
  for (auto i = 0; i < checkpoints_to_create; ++i) {
    const auto snapshot_name =
        test::PerThreadDBPath(env_, "snap_" + std::to_string(i));
    std::unique_ptr<Checkpoint> checkpoint;
    Checkpoint* checkpoint_ptr;
    ASSERT_OK(Checkpoint::Create(db_, &checkpoint_ptr));
    checkpoint.reset(checkpoint_ptr);

    ASSERT_OK(checkpoint->CreateCheckpoint(snapshot_name));
    snapshot_names.push_back(snapshot_name);
  }

  mutex.WriteLock();
  finished = true;
  mutex.WriteUnlock();

  for (auto& t : threads) {
    t.join();
  }

  Close();

  options.skip_stats_update_on_db_open = true;
  options.skip_checking_sst_file_sizes_on_db_open = true;
  options.max_open_files = 10;

  for (const auto& snapshot_name : snapshot_names) {
    DB* snapshot_db = nullptr;
    ASSERT_OK(DB::Open(options, snapshot_name, &snapshot_db));
    ASSERT_OK(snapshot_db->Close());
    delete snapshot_db;
  }
}

But I've also wrote more precise unit tests using sync points, so I'll include them into my PR with a suggested fix.

Conditions to reproduce are:

wal_size_for_flush is non-zero, so the WAL file gets copied during checkpoint;
while checkpoint is in progress, there are write operations happening in the background;
wal_recovery_mode = WALRecoveryMode::kAbsoluteConsistency when opening DB from the checkpoint.

This happens because size of the active WAL file is captured at a random moment:

truncated record body error happens when WAL file size is captured right after WritableFileWriter flush when in-memory buffer no longer has space for new data
error reading trailing data happens, when WAL record gets broken down into multiple physical records, and WAL file size was captured before last fragment has been written.

facebook / rocksdb