cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.77k stars 442 forks source link

perf: compaction improvements meta issue #552

Open petermattis opened 4 years ago

petermattis commented 4 years ago

There are various proposals for improvements to compaction heuristics, policies, and mechanics. This issue is an entry point to all of those sub-issues.

Generalized "Move" Compactions

A compaction normally takes all of its input tables and constructs a set of new output tables. There is an exception to this behavior: move compactions take the input tables and "move" them (via metadata operations) directly to the output level. Move compactions can only be performed if there are no overlapping tables in the output. The two issues below both play with the idea of generalizing move compactions so that we can reuse input sstables and thus significantly lower the IO for certain compactions.

#518: perf: generalized "move" compactions #389: perf: skip output level sstables with no data overlap with input level

L0 Partitioning / User-defined Partitions

When a set of memtables is flushed to L0, only a single sstable is created. This leads to a problem: an L0 sstable often overlaps a significant portion (all!) of the sstables in Lbase. This in turn leads to problems with L0->Lbase compactions starving Lbase->Lbase+1 compactions. One solution to this problem is to partition L0 sstables. As described in #517, the intention is provide a callback so that user's can request a split point between two user-keys. In addition to the in-progress PRs below, we'd need to change the L0 invariant checks, and enhance the L0 compaction picking heuristics.

#517: perf: support user-defined "guards" for sstable boundaries #546: internal/manifest: add Version.L0Sublevels #536: db: add support for user-defined sstable partitioning

Multi-level Compactions

Compactions currently involve two levels: a start level and an output level. The slight nuance to this statement is that each sstable from L0 is considered a distinct level. The idea behind multi-level compactions is to avoid adding more compaction work in the future. If we're compacting to level N, we can estimate the increase in size of N and if it would exceed its size threshold, we should actually compaction to level N+1. The balanced-rent-or-buy compaction strategy described in #48 takes this to an extreme and demonstrates much better write amplification than other compaction strategies in simulation.

#136: perf: investigate N-level compactions (N > 2) #48: perf: better compaction heuristics/policy

"Notched" Multi-output-level Compactions

In addition to allowing inputs from multiple levels, a compaction can write outputs to multiple levels. This idea was suggested in the context of addressing the L0->Lbase compaction problem.

#136: perf: investigate N-level compactions (N > 2)

Read-triggered Compactions

In Pebble, compactions are currently only triggered due to writes. The deficiency with this trigger is that the LSM will settle into a shape which is optimized for handling future writes, but not for reads. In a read-mostly workload, we would like to perform additional compactions to reduce the number of levels in order to speed up reads. LevelDB contained a read-based compaction trigger that did exactly this.

#29: perf: read-based compaction heuristic

Range-tombstone Compaction Heuristics

The heuristics for deciding which file within a level to compact do not perform any special consideration for range tombstones. A range tombstone can delete a large swath of data, making it desirable to quickly compact sstables containing range tombstones in order to reclaim space and to avoid rewriting data lying beneath the range tombstone which will only be deleted by a later compaction.

#319: db: incorporate range tombstones into compaction heuristics

Deletion-only Compactions

If an sstable is entirely covered by a range tombstone in a higher level and the range of seqnums in the sstable and the range tombstone are in the same snapshot stripe, the sstable can be deleted. This is better than a normal compaction as the sstable does not have to be read.

#720: perf: investigate deletion-only compactions

Pipeline Decompression and Compression During Compaction

Block decompression and compression show up prominently in CPU profiles taken during compaction. We could speed up the wall time of compactions by parallelizing this work.

#599: perf: pipeline compression during compaction

1722: perf: pipeline decompression during compaction

Jira issue: PEBBLE-176

github-actions[bot] commented 2 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!

github-actions[bot] commented 7 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!

petermattis commented 7 months ago

Worth keeping this meta issue around as well as the sub-issues it links to.