cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.74k stars 435 forks source link

db: NextPrefix (and similar) iterator optimizations #2090

Open jbowens opened 1 year ago

jbowens commented 1 year ago

Creating an issue to track the NextPrefix optimization and related optimizations. There are a few avenues of exploration:

  1. When the user wants to proceed to the next prefix (eg, a MVCC scan) during forward iteration, the internal iterator on the top of the merging heap is known to be positioned at the old prefix. We can pass down the iterator stack the knowledge that we'd like this level's iterator to move to the next prefix. If this internal iterator is an sstable iterator, we may be able to use the block layout's prefix sharing to infer that subsequent keys share the same prefix. This requires care to properly handle byte prefixes that encode longer, unequal logical prefixes. See #1387 for an old exploration of this idea.
  2. When advancing the iterator to a proximate key in the forward direction (eg, either the next prefix, or more generally a key that we have high confidence would not be too distant), it may not be necessary to re-seek every level of the iterator. Today, a seek always reseeks every level and then "fixes up" the merging heap with a O(n)-comparison heapify operation. If we know the seek key should be nearby, it's likely a small proportion of the levels actually need to be repositioned. Instead, we could reposition the top of the heap, restore heap order and repeat if necessary. This approach requires O(k log n) comparisons, where k is the number of times we need to reposition a level. This is the approach #1860 attempts in implementing NextPrefix. This approach might be generalizable to a wider set of cases (eg, SeekGE w/ TrySeekUsingNext) with a heuristic about when to apply which approach.
  3. Key comparisons that attempt to determine whether a key is a different prefix can try comparing suffixes first. If the next key's suffix is > the previous suffix, then the next key must have a different prefix. This may be a helpful optimization when keys are larger. It's unclear how fruitful this is in practice.

Jira issue: PEBBLE-129

github-actions[bot] commented 4 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!