Open petermattis opened 4 years ago
One implementation idea is to implement an altered iterator interface (an interface other than internalIterator
) in levelIter
and mergingIter
. In addition to surfacing point keys, this interface could surface keys of interest, like sstable boundaries. We already jump through a few hoops to surface boundaries of interest from levelIter
to mergingIter
today (eg, synthetic iter boundaries, sstable boundaries, etc), and this new interface would make that explicit.
For implementing generalized 'move' compactions, the merging iterator could be configured to surface start boundaries to the compaction iterator. When the compaction encounters a start boundary, the start boundary's levelIter
is at the top of the mergingIter
heap. The compaction loop could ask the mergingIter
whether the next key in all of the other levelIter
s in the heap is greater than than the end boundary for the levelIter
at the top of the heap. If yes, the compaction just stepped into a file that overlaps with no other files and may be 'moved' (or left in place, if it's an output-level file #389).
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!
See https://github.com/cockroachdb/pebble/issues/517 for background on sstable boundary signals. Consider the same scenario mentioned in that issue:
In this scenario, L0 only contains two keys:
a
andz
. Currently, a compaction from L0 to L1 will end up rewriting all of the sstables in L1. This is quite wasteful as we really only need to rewrite 2 sstables. The rest could be reused. Pebble already supports the notion of a "move" compaction which is performed if a compaction from Ln to Lm involves only a single input table. In that case, the input table is simply moved to the output table level: a pure metadata operation that is very fast. The generalization of "move" compactions is to recognize when the output records for a compaction are identical to the records in a single input table.I have a vague sense of how this could be done. We could create separate
levelIters
on the input tables and seek on them to see where there are gaps in the key space of the inputs. In the scenario here, we'd create aninputIter[0]
andinputIter[1]
. When thecompactionIter
steps into one of the L1 input tables, we'd check theinputIter[0]
to see if there are any keys which overlap in L0. If there are, we proceed with compaction until the next L1 input table. If there is no overlap with keys in L0, we "move" the L1 input table directly to output. Reminder that we need to check for both point and range tombstone overlap.One downside to moving sstables like this, is that we lose the snapshot and tombstone processing that takes place in
compactionIter
. This is only a problem if we're actually moving an sstable from L0 to L1.I don't see any fundamental obstacles to this mechanism, though the compaction logic is already dauntingly complex. As a prerequisite, we need to figure out better abstractions, and to break up the already too complex
DB.runCompaction
method.Cc @sumeerbhola, @itsbilal
Jira issue: PEBBLE-177