cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.88k stars 449 forks source link

db: potentially trade write amp for space amp during compactions through vsstables #2752

Open bananabrick opened 1 year ago

bananabrick commented 1 year ago

During some compaction from a level i to a level i+1, the file f from level i will overlap with some files from level i+1.

Usually, there are also some files in level i+1 which will only partially overlap with the file f.

Consider the following case:

File in level i, f1: [c,f].
File in level i + 1, f2: [a, b, c, d], f3: [e, g, h, i].

The file f1 overlaps with both the files f2 and f3. Moving the file f1 to to level i+1, will have a write cost of 9 bytes. For the 8 keys in f1 and f2, and we're adding a new key f.

But there's an opportunity here to virtualize f3. f1 only overlaps with a single key e in f3. We could convert f3 into a virtual sstable f5: [g, h, i] as part of the compaction version edit. If we do this, we'll only have to write 6 bytes as part of the compaction.

We should only perform this virtualization if the data in f3, which f1 overlaps with is significantly less that the total size of f1.

Future compactions will further whittle down the file f5, without incurring a write cost for keys which they do not overlap with.

To prevent space amp from blowing up, we could make sure that as these virtual sstables get small compared to the original physical sstable size, we compact them. I believe this heuristic has already been implemented as part of the existing virtual sstable work. We could also make it less likely that these virtual sstables are created as the disk utilization increases.

Next Steps The first step would be to create a metric to determine what percentage of write amps are due to this partial overlap during compactions.

Jira issue: PEBBLE-62

Epic CRDB-40361

sumeerbhola commented 1 year ago

I think this idea is similar to the original text in https://github.com/cockroachdb/pebble/issues/2156#issue-1471536094. In that the virtualization was happening at the higher level (there are tradeoffs). That issue was closed since @jbowens found an improved grandparent splitting heuristic that results in better alignment https://github.com/cockroachdb/pebble/issues/2156#issuecomment-1398846061

It is possible that there is further room for improvement via such virtualization -- we should quantify that first.