cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.8k stars 444 forks source link

perf: generalized "move" compactions #518

Open petermattis opened 4 years ago

petermattis commented 4 years ago

See https://github.com/cockroachdb/pebble/issues/517 for background on sstable boundary signals. Consider the same scenario mentioned in that issue:

L0 a------------------------z
L1 ab de gh jk mn pq st vw yz

In this scenario, L0 only contains two keys: a and z. Currently, a compaction from L0 to L1 will end up rewriting all of the sstables in L1. This is quite wasteful as we really only need to rewrite 2 sstables. The rest could be reused. Pebble already supports the notion of a "move" compaction which is performed if a compaction from Ln to Lm involves only a single input table. In that case, the input table is simply moved to the output table level: a pure metadata operation that is very fast. The generalization of "move" compactions is to recognize when the output records for a compaction are identical to the records in a single input table.

I have a vague sense of how this could be done. We could create separate levelIters on the input tables and seek on them to see where there are gaps in the key space of the inputs. In the scenario here, we'd create an inputIter[0] and inputIter[1]. When the compactionIter steps into one of the L1 input tables, we'd check the inputIter[0] to see if there are any keys which overlap in L0. If there are, we proceed with compaction until the next L1 input table. If there is no overlap with keys in L0, we "move" the L1 input table directly to output. Reminder that we need to check for both point and range tombstone overlap.

One downside to moving sstables like this, is that we lose the snapshot and tombstone processing that takes place in compactionIter. This is only a problem if we're actually moving an sstable from L0 to L1.

I don't see any fundamental obstacles to this mechanism, though the compaction logic is already dauntingly complex. As a prerequisite, we need to figure out better abstractions, and to break up the already too complex DB.runCompaction method.

Cc @sumeerbhola, @itsbilal

Jira issue: PEBBLE-177

jbowens commented 2 years ago

One implementation idea is to implement an altered iterator interface (an interface other than internalIterator) in levelIter and mergingIter. In addition to surfacing point keys, this interface could surface keys of interest, like sstable boundaries. We already jump through a few hoops to surface boundaries of interest from levelIter to mergingIter today (eg, synthetic iter boundaries, sstable boundaries, etc), and this new interface would make that explicit.

For implementing generalized 'move' compactions, the merging iterator could be configured to surface start boundaries to the compaction iterator. When the compaction encounters a start boundary, the start boundary's levelIter is at the top of the mergingIter heap. The compaction loop could ask the mergingIter whether the next key in all of the other levelIters in the heap is greater than than the end boundary for the levelIter at the top of the heap. If yes, the compaction just stepped into a file that overlaps with no other files and may be 'moved' (or left in place, if it's an output-level file #389).

github-actions[bot] commented 8 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!