cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.74k stars 435 forks source link

compaction: use bloom filters to elide point tombstones earlier #2245

Open jbowens opened 1 year ago

jbowens commented 1 year ago

It seems feasible to develop a heuristic that sometimes uses lower-level files' bloom filters to elide point tombstones earlier if the bloom filters indicate a key is not present. When successful in proving a tombstone is no longer needed, it would prevent the write amplification of writing them to disk + the CPU overhead of nexting through the tombstones during reads. A heuristic could enable this for a compaction only if the inputs have many point tombstones according to their sstable properties and the compaction overlaps sufficiently few files in lower levels.

Bloom filters are defined over prefixes only, so this does not seem beneficial in regions of the keyspace with many versions of the same key. This might be able to be addressed by restricting this optimization to compactions overlapping a keyspan (eg, the lock table, the raft log) or with a heuristic consulting a sstable property on lower-levelled sstables containing the number of unique prefixes within sstable (to compare versus the total entries, to get a sense of the number of duplicates).

Jira issue: PEBBLE-155

jbowens commented 1 year ago

Consulting the bloom filter could also be restricted to when a tombstone drops a key during the compaction. If we have the expectation that each tombstone typically drops 1 key, the same compaction that drops a shadowed key is likely the one that can elide the point tombstone.

sumeerbhola commented 1 year ago

This might be able to be addressed by restricting this optimization to compactions overlapping a keyspan (eg, the lock table, the raft log)

btw, we already try to use SingleDelete for the lock table. And I suspect with a bit of tracking in KV we could do the same for the raft log. Also, even if we don't use SingleDelete on the raft log, the separation of the raft log into a separate LSM (which I assume will happen within the next couple of releases) will make that LSM very shallow -- possibly with Lbase==L6 (we may even tune the Lbase target size to be higher to cause this shallowness, and increase the number of sublevels in L0 permitted before compactions).