storage: improve single delete safety

jbowens commented 10 months ago

@sumeerbhola and I were talking in the office about what we can do to protect/assert against SINGLEDEL misuse or invariant violations in light of the a test cluster’s replica divergence which could plausibly be explained by a SINGLEDEL misuse. During intent resolution we have an iterator walking through the lock table. When we decide to use a SINGLEDEL, we could conceivably (with some Pebble interface contortions) next the internal iterator once to confirm that there are no additional internal keys within the LSM. We’d add a small regression to intent resolution in the name of safety and quality. The interface contortions to do this would be ugly, and we might still miss keys above the iterator’s sequence number or keys that might commit later but before our batch commits.

This has me wondering whether we might be better off relying wholly on Pebble to make the determination of whether to persist a DEL or a SINGLEDEL. Today we only use SINGLEDELs for intents because it’s hard to ensure that we’ll never write the same Pebble user key twice, even if the vast majority of the time we’ll only write once. Because of this we don’t get any of the SINGLEDEL benefits for raft log truncation or MVCC garbage collection in low GC TTL ranges.

We could make the decision to transform a DEL into a SINGLEDEL at flush time, if heuristics indicate there are enough DELs to be worth the overhead of reading the lower LSM state. This is a nontrivial additional cost to flushes, but we save write amp and compaction cpu time when we successfully remove a resulting SINGLEDEL before it reaches L6.

Informs cockroachdb/cockroach#114421.

Jira issue: CRDB-33530

blathers-crl[bot] commented 10 months ago

Hi @jbowens, please add a C-ategory label to your issue. Check out the label system docs.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

jbowens commented 10 months ago

The biggest obstacle to an automatic single delete optimization performed entirely within the storage engine is the need to perform reads. Writes are always blind within the storage engine. In order to determine how many keys a tombstone deletes, the engine would need to read the full LSM at each tombstone. The memtable tends to be as wide as the entire engine keyspace, which in the worst case means every tombstone would require loading a disjoint set of n sstables, where n is the read amplification.

But In Cockroach, we actually have significant locality of delete tombstones. There are three primary sources of deletes, all receiving an about equal portion of all the deletes in the system:

The lock table is a high-churn region of the keyspace. All of the KV ranges' lock table keyspaces are contiguous, which at the Engine-level means the lock table is an uninterrupted continuous keyspace.
The raft log is another high-churn region, and tombstones are written during log truncation. The KV ranges' raft logs are not contiguous, because there are other unreplicated RangeID-prefixed keys interspersed, but the interspersed keys should be relatively low volume.
MVCC GC builds a batch deleting keys within a single KV range at a time. Since the resulting batch is constrained to a single KV range, the batch has high locality. With 512 MB ranges and 128 MB L6 sstables, we'd expect one of these batches to overlap ~5 sstables in L6.

If Pebble were to perform reads to transparently identify tombstones that may be relaxed into SINGLEDELs, where would it do it? During a flush is viable, but adding additional read I/O to the flush risks further constraining an existing bottleneck. In some high write throughput experiments with high bandwidth disks, we've observed consistently high flush utilization.

Alternatively, we could perform these reads in the background, while building the current mutable memtable.

A sketch:

Batches record the count of point tombstones they receive. At memtable application time, if point tombstones make up a sufficiently large portion of the batch's contents, the batch signals on a condition variable after recording:

the memtable arena offset of its first node (batch order)
the memtable arena offset of its last node (batch order)
the memtable arena offset of its first node in key order (this can be deduced without additional key comparisons during batch application by comparing offsets of incoming links in the lowest level)
the memtable arena offset of its last node in key order.

cockroachdb/pebble#3097 would avoid the above arena offsets capturing many more nodes than belong to the specific batch.

The above batch metadata forms a queue of "delete batches" for a goroutine associated with the memtable to process. The goroutine inspects the smallest and largest keys of the each queued delete batch (through accessing the memtable nodes at the offsets recorded), and uses them to find overlapping files within the LSM beneath the batch. Heuristics (?) determine whether the batch's contents are sufficiently dense to motivate reading the lower LSM. If they are, the goroutine uses an internal iterator constructed over the entirety of the LSM beneath the memtable and including the memtable itself to determine the number of internal keys beneath each tombstone within the "delete batch": 0, 1 or 2+. It atomically updates a field on the KV's arenaskl.Node.

During compaction time, the additional knowledge retained by the memtable's goroutine is surfaced on the base.InternalKV, allowing the compaction iterator to mutate key kinds or elide tombstones. The memtable's goroutine would be racing with the compaction, but there's no correctness problem to failing to observe the memtable goroutine's metadata. We would need to ensure we don't free the memtable until the memtable goroutine has exited.

There is still a lot of ambiguity in the above:

What would a reasonable heuristic on when to perform the LSM read look like?
There are likely opportunities to reduce the cost of the LSM read. For example, if we used a getIter-style iterator, and we haven't found a version of a key beneath a tombstone by the time we reach L6, maybe it's not worth reading L6 since a single delete that must be compacted into L6 to drop anything is no better than an ordinary delete tombstone.
Would it be worthwhile to transform DELs into appropriately-sized DELSIZEDs during reads that find multiple internal keys, or reads that find a single key in L6?

The above is complicated, but arguably not compared to the KV transaction invariants we currently rely on.

jbowens commented 10 months ago

Informs https://github.com/cockroachdb/cockroach/issues/8979

cockroachdb / cockroach

storage: improve single delete safety #114492