facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.14k stars 6.26k forks source link

Proposal: Mmap WAL log and save key value offsets in MemTable #11835

Open rockeet opened 11 months ago

rockeet commented 11 months ago

WAL log normally using standary io, thus data will write to page cache first, and the same data are written to MemTable, when MemTable is full, it will be flushed to L0 sst.

There are wastes in such solution:

  1. Data in WAL page cache
  2. Data in MemTable
  3. Data in L0 sst (extra IO, file space, page cache memory and BlockCache memory)

An optimized solution should be:

  1. mmap(with readonly) WAL file with truncated to a initial size(such as 4G)
    • this space reserving will not consume memory and disk/ssd spaces because VM and sparse file(sparse on file tail is wildly supported and stable)
    • on process crashes, logical EOF can be dicided by checksum
    • filesize will not be changed, thus fdatasync will not update file metadata
  2. Write data to WAL with standard io(buffered write), write to mmap should be denied
  3. KV entries in MemTable are just point(to WAL file offsets of KV)
  4. When a WAL is full(ex. exceding 4G), convert the WAL file to an SST file:
    • Append KV index(offsets of KV data) and SST file footer with a special type(of a dedicated TableFactory)
    • This not only reduce memory usage, but also reduce IO & CPU, especially when MemTable content is relocateable(use relative offsets instead of pointers, can be dumped)

On our branch, we have implemented an simplified version of this solution: Convert MemTable to SST instead of Flush MemTable, this gains many improvements but is still not ideal.

On top of our relocateable patricia trie, MemTable ConvertToSST just append an customized file footer, wrap the MemTable file as an SST file(with a custom TableFactory), and rename the MemTable file to SST file.

Reusing WAL page cache is much more complex than our MemTable ConvertToSST solution, because this will envolving many rocksdb changes(An challenge is that multiple CFs are sharing same WAL, can reuse blob file manager?).

wolfkdy commented 10 months ago
  1. Seems a good optimization for write-only or write-heavy workloads.
  2. Perhaps not friendly for read-latency-sensitive workloads because the latency reading from memtables and imms are hand-off to mmap, which is a black-box. It seems mmap is abandoned in modern db designs.
  3. If treat wal as blob file rather than a mmaped-file, there are still some differences to solve. 3.1, blob file is static, it's created and sealed in a Flush/CompactionJob, and won't be opened again for writes. so the currrent or alive wal-blob will be different. 3.2 Still not friendly for general-use workloads. To eliminate read-io, the blob written into wal-blob should be inserted into block-cache immediately after writing wal-blob. Which in turn does not reduce memory usage. 3.3 With careful implementation, it should be a great option for write-heavy workloads.
rockeet commented 10 months ago

Mmap is not a devil, rocksdb also have PosixMmapReadableFile, wal and L0 files contains hot data which are very likely in page cache, thus read mmap latencies are predictable.

Our simplified implementaion(MemTable ConvertToSST) has already shown the advantages, it both optimized read and write.

IO write bandwidth is not reduced in ConvertToSST if sync the MemTable SST to disk, even do not sync disk(the files will be compacted and deleted very soon, ext4 filesystem will not auto sync deleted files with dirty pages), process crashes still have no harm.

blob files are static, this is really a small issue, with careful design, it should not be a block issue.