facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.44k stars 6.29k forks source link

checkpoint flush changes WAL file metadata with recycle_log_file_num enabled #8283

Open shoda-tibco opened 3 years ago

shoda-tibco commented 3 years ago

Expected behavior

With recycle_log_file_num enabled, creating a checkpoint which flushes the WAL uses a recycled WAL file and does not change the file size of the the old WAL file (which undoes the performance benefit of caching the WAL because the metadata must be synced after every write again).

Actual behavior

After taking a checkpoint which flushes the current WAL, it appears that the old WAL has its file metadata modified, which changes the file size. There after we observe that sync writes to that WAL file must sync metadata with each write, reducing performance until that WAL is recycled again.

Steps to reproduce the behavior

Initially reproduced with v6.2.4 though the behavior appears to be the same on HEAD.

Write to db until have 2 WAL files present on disk of nearly identical size. Observe that new writes don't change the size of WAL files.

Take a checkpoint with log_size_for_flush=0. Observe that older WAL file has file size modified on disk, and that new write throughput decreases until size is restored.

LOG file: https://gist.github.com/shoda-tibco/44e80f4603aa745b30f825ff517dc5e9

The actual re-writing of the WAL file size appears to occur as part of:

EnableFileDeletions calling job_context.Clean() which in turn does:

      for (auto l : logs_to_free) { 1 ref
        delete l;
      }

which ultimately winds up calling WritableFileWriter::Flush(), after which the metadata is re-written.

zhichao-cao commented 3 years ago

cc @ajkr You may know more about it?