cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.91k stars 455 forks source link

perf: snapshot keyspan bounds #1810

Open jbowens opened 2 years ago

jbowens commented 2 years ago

Currently, a snapshot always applies universally across all keys in the database. In CockroachDB, snapshots are used to preserve state within the context of a single range. An LSM snapshot constructed to read range r1 still prevents removal of obsolete keys in range r2.

We could extend NewSnapshot to allow supplying a predicate p(k)→bool that configures the snapshot to only snapshot keys for which p returns true. During compaction and flushes, when a snapshot appears to produce a new snapshot stripe within the same key, p(k) is consulted and a new stripe is produced only if p(k)→true. This would allow overwritten keys in a non-snapshotted CockroachDB range to be dropped while preserving overwritten keys in a snapshotted CockroachDB range.

There's still the question of what to do when an iterator constructed through Snapshot.NewIter reads a key for which the predicate p(k)→false:

  1. Do nothing—the result of iterating over keys that do not match p(k) is undefined. The caller must be careful to never use iteration results outside the predicate. If the predicate is defined over swaths of the key space, this may be achieved through setting iterator bounds.
  2. Filter—during iteration, before returning, test p(k), skipping the key if p(k)→false.
  3. Read at higher sequence number—when the iterator is constructed from the snapshot, record two sequence numbers: the current visible sequence number on the database, and the snapshot's sequence number. User keys for which p(k)→true are filtered at the snapshot's sequence number. User keys for which p(k)→false are filtered at the database's visible sequence number.

I expect limiting the scope of active snapshots would reduce write amplification, in particular during periods of heavy rebalancing where there are open LSM snapshots and replicas are being simultaneously removed. Replica removal lays down range deletions, but those range deletions are unable to drop the replica's data. Compaction of these range deletions is still prioritized, because wide range deletions force ingested sstables into higher levels. The result is we suffer unnecessary write amplification moving the removed replica's data and the range tombstone into L6.

If we are to tackle this, I think we might want to expose a very limited interface, at least from the CockroachDB pkg/storage package that meets our specific snapshot usages. This can help avoid the possibility of reading unshapshotted keys while under the impression of reading through a consistent snapshot.

The amount of write amplification saved is still unknown. Adding metrics for the size of obsolete keys preserved during compactions (#1204) would help us prioritize.

Jira issue: PEBBLE-127

sumeerbhola commented 2 years ago

Can we make this less general, and limit predicates to a single key-span, and check that all iterator bounds are limited to that key-span?

jbowens commented 2 years ago

I think the fragmentation of a range's various key spaces (range-id, range-local, etc) force us to snapshot multiple key spans together.

sumeerbhola commented 2 years ago

I think the fragmentation of a range's various key spaces (range-id, range-local, etc) force us to snapshot multiple key spans together.

Ah yes. We could still have it be explicitly represented as a set of spans, yes?

jbowens commented 2 years ago

Yeah, for sure