db: SeekPrefixGE lazy positioning, or GetPrefix

cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go

BSD 3-Clause "New" or "Revised" License

4.81k stars 448 forks source link

db: SeekPrefixGE lazy positioning, or GetPrefix #2002

Open jbowens opened 1 year ago

jbowens commented 1 year ago

I have not thought through the design space here in detail, but it seems possible to use MVCC metadata about sstables (eg, the computed block properties) to avoid reading files that contain older versions of a key during a SeekPrefixGE. The goal would be to reduce block reads during MVCCGets, making MVCCGets performance profile more similar to a pebble Get.

A design, just to serve as an illustrative example:

the sstable iterator SeekPrefixGE returns a synthetic <prefix>@<sstable's max mvcc timestamp>key if the bloom filter indicates the file may contain the key
the merging iterator is updated to skip past these synthetic keys (much like it does today for range tombstones)
nexting a sstable iterator last positioned by SeekPrefixGE actually performs the seek

if we elevated MVCC timestamps into the *fileMetadata, it seems like we could even avoid some of the bloom filter reads and table loads.

Somewhat related to #2182.

Jira issue: PEBBLE-142

jbowens commented 1 year ago

Added the consider-23.2 label because this and avoiding bloom filter block reads (by elevating the mvcc props to the fileMetadata) seems especially beneficial with disaggregated storage, where each new table read requires reading from remote storage.

sumeerbhola commented 1 year ago

I'm a bit confused by

the sstable iterator SeekPrefixGE returns a synthetic @<sstable's max mvcc timestamp>key if the bloom filter indicates the file may contain the key

since I thought this is trying to avoid reading the bloom filter of that file. Maybe I am misreading the sentence. Is the idea here that many levels will have files that overlap with the prefix, but the max MVCC timestamp in most of these files will be smaller than the actual largest MVCC timestamp for that prefix, so these files will never rise to the top of the heap and so we will never need to read their bloom filters? Nice idea! It should help when the version being read was written recently, which I believe is more common for workloads with huge tables with lots of old data (which are exactly the ones where the bloom filter is more likely to miss the cache).

jbowens commented 1 year ago

since I thought this is trying to avoid reading the bloom filter of that file.

Ah yeah, when I started writing the issue I was thinking we'd perform the bloom filter check before returning the synthetic key and we'd only be saving the block loads on the tables that the filter fails to exclude, but I think it makes more sense to reverse the order to try to avoid the bloom filter check as well.

Is the idea here that many levels will have files that overlap with the prefix, but the max MVCC timestamp in most of these files will be smaller than the actual largest MVCC timestamp for that prefix, so these files will never rise to the top of the heap and so we will never need to read their bloom filters?

Yeah, that's right

jbowens commented 8 months ago

This may be helpful with #3230 for workloads that are frequently reading and updating recently-written keys. It would shift the working set of data towards the higher levels of the LSM (which are also smaller), making it more likely that a read finds that everything it needs is already in either the block cache or OS page cache.