Open jbowens opened 2 years ago
Can we make this less general, and limit predicates to a single key-span, and check that all iterator bounds are limited to that key-span?
I think the fragmentation of a range's various key spaces (range-id, range-local, etc) force us to snapshot multiple key spans together.
I think the fragmentation of a range's various key spaces (range-id, range-local, etc) force us to snapshot multiple key spans together.
Ah yes. We could still have it be explicitly represented as a set of spans, yes?
Yeah, for sure
Currently, a snapshot always applies universally across all keys in the database. In CockroachDB, snapshots are used to preserve state within the context of a single range. An LSM snapshot constructed to read range r1 still prevents removal of obsolete keys in range r2.
We could extend
NewSnapshot
to allow supplying a predicatep(k)→bool
that configures the snapshot to only snapshot keys for whichp
returns true. During compaction and flushes, when a snapshot appears to produce a new snapshot stripe within the same key,p(k)
is consulted and a new stripe is produced only ifp(k)→true
. This would allow overwritten keys in a non-snapshotted CockroachDB range to be dropped while preserving overwritten keys in a snapshotted CockroachDB range.There's still the question of what to do when an iterator constructed through
Snapshot.NewIter
reads a key for which the predicatep(k)→false
:p(k)
is undefined. The caller must be careful to never use iteration results outside the predicate. If the predicate is defined over swaths of the key space, this may be achieved through setting iterator bounds.p(k)
, skipping the key ifp(k)→false
.p(k)→true
are filtered at the snapshot's sequence number. User keys for whichp(k)→false
are filtered at the database's visible sequence number.I expect limiting the scope of active snapshots would reduce write amplification, in particular during periods of heavy rebalancing where there are open LSM snapshots and replicas are being simultaneously removed. Replica removal lays down range deletions, but those range deletions are unable to drop the replica's data. Compaction of these range deletions is still prioritized, because wide range deletions force ingested sstables into higher levels. The result is we suffer unnecessary write amplification moving the removed replica's data and the range tombstone into L6.
If we are to tackle this, I think we might want to expose a very limited interface, at least from the CockroachDB
pkg/storage
package that meets our specific snapshot usages. This can help avoid the possibility of reading unshapshotted keys while under the impression of reading through a consistent snapshot.The amount of write amplification saved is still unknown. Adding metrics for the size of obsolete keys preserved during compactions (#1204) would help us prioritize.
Jira issue: PEBBLE-127