cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.94k stars 458 forks source link

db: lower-level gap heuristic #2366

Open jbowens opened 1 year ago

jbowens commented 1 year ago

Consider a workload where the keyspaces a-f and u-z are write heavy, while the keyspace g-t is read only and heavy. Incoming writes to a-f and u-z keyspaces can easily be included in the same sstable, spanning the read-only g-t keyspace. These sstables unnecessarily increase the read amplification of reads in the g-t keyspace.

L2 [─•─────────────•──────────────────────────────────────────•─•─] [•──••─•─••─]
L3 [──•────•────•──•]                                            [─••─────────••]
L4 [•──•─•]  [•──•────────────────────────────────────────────•──•─•─]   [─••───]
L5 [••] [••] [•─•─•────────────────────────────────────────────•]  [••─•─•─]  [•]
L6 [•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•]
    a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z

Ideally, we'd avoid constructing sstables spanning the read-only region so that only the regions receiving writes suffer read-amp:

L2 [─•─────────────•]                                        [•─•─] [•──••─•─••─]
L3 [──•────•────•──•]                                            [─••─────────••]
L4 [•──•─•]  [•──•]                                          [•──•─•─]   [─••───]
L5 [••] [••] [•─•─•]                                          [•]  [••─•─•─]  [•]
L6 [•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•][•]
    a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z

User-defined "guards" #517 are one approach to this problem, but shift the compaction-output splitting problem to the user. Setting too aggressive of guards can result in many little files in the LSM.

There might be an opportunity for a heuristic that incorporates lower level file boundaries into the compaction iterator's frontiers type, and notices when the next key skips over a large gap in lower levels. When a large enough gap is encountered, the output can be split early.

See cockroachdb/cockroach#93427 for another approach at reducing the effective read-amp in this kind of scenario.

Jira issue: PEBBLE-202

Epic CRDB-40361

sumeerbhola commented 1 year ago

This can help with write amp too.

petermattis commented 1 month ago

There might be an opportunity for a heuristic that incorporates lower level file boundaries into the compaction iterator's frontiers type, and notices when the next key skips over a large gap in lower levels. When a large enough gap is encountered, the output can be split early.

Could use sstable.Reader.EstimateDiskUsage to decide when a gap is "large" vs "small", and use virtual sstable slicing to avoid re-writing an L6 sstable when only a portion of it is overlapping with a higher-level. There is some overlap with this idea and https://github.com/cockroachdb/pebble/issues/518 (generalized "move" compactions).