facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.47k stars 6.3k forks source link

Support read-triggered compactions in RocksDB #8004

Open dhruba opened 3 years ago

dhruba commented 3 years ago

Expected behavior

If an application is incurring a high number of lookups in multiple levels of the SST tree, then it is better to compact that part of the tree so that future lookups do not incur this cost. Compactions triggered by reads would help reduce cpu costs as well as storage iops and storage size cost for these applications.

Actual behavior

In today's behaviour, suppose the same set of key-range reside in every level of a 7 level SST tree and suppose that no new writes to this key range are occuring. No compactions occurs. And every read request has to look through all the 7 levels of the data to return results.

The cost of lookup is reduced somewhat in the presence of bloom filters, but even then, the cost of cpu to find the key in every one of the 7 individual trees is high.

Steps to reproduce the behavior

dhruba commented 3 years ago

History: rocksdb has read-triggered compactions in the very early days to account for disk IOs. https://github.com/facebook/rocksdb/commit/c1bb32e1ba9da94d9e40af692f60c2c0420685cd. The thinking at that time was mostly to optimize disk seeks as you can see from this blurb:

      // We arrange to automatically compact this file after
      // a certain number of seeks.  Let's assume:
      //   (1) One seek costs 10ms
      //   (2) Writing or reading 1MB costs 10ms (100MB/s)
      //   (3) A compaction of 1MB does 25MB of IO:
      //         1MB read from this level
      //         10-12MB read from next level (boundaries may be misaligned)
      //         10-12MB written to next level
      // This implies that 25 seeks cost the same as the compaction
      // of 1MB of data.  I.e., one seek costs approximately the
      // same as the compaction of 40KB of data.  We are a little
      // conservative and allow approximately one seek for every 16KB
      // of data before triggering a compaction.
      f->allowed_seeks = (f->file_size / 16384);
      if (f->allowed_seeks < 100) f->allowed_seeks = 100;

The above antiquated logic triggered a compaction if the number of seeks from storage for that particular file exceeded the allowed_seeks number calculated above.

The new logic to trigger compactions-on-read takes should take into account not just storge IOs but also the cpu cost of looking into each individual index of an sst file.