cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.15k stars 3.81k forks source link

kvserver: process MVCC range tombstones during range splits/merges #70413

Closed erikgrinaker closed 2 years ago

erikgrinaker commented 3 years ago

In #70412, we will implement MVCC range deletion tombstones, recorded in a range-local part of the key space. When a range is split or merged, these tombstones must also be split or merged as appropriate. We should add MVCC primitives to carry out the necessary tombstone changes, and call these as appropriate during splits/merges.

See also the (currently internal) design document.

Epic CRDB-2624

Jira issue: CRDB-10063

tbg commented 3 years ago

This has a hairy edge case if the amount of mvcc rangedels becomes large, as we are basically forced to rewrite them in a single raft command (if we do it as part of the split), thus potentially leading to a giant (and perhaps too-large-for-raft) split trigger. Especially if we "break down" the requested MVCC rangedeletion during evaluation (as I think we are planning to do) the existence of many MVCC tombstones has to be expected in practice.

We could work around the problem by placing a limit on the number of MVCC rangedels. When that limit is exceeded, we compile all future rangedels down to point deletions. This might be the pragmatic solution, but it does introduce an opaque behavior change if hit.

An alternative is pre-rewriting the tombstones to reflect the split. That is, before we even start the split txn, we spend some time rewriting the range deletions in-place (through raft, because consistency checker). Then the split "presumably" has only a few of them left to split. But... perhaps not, who knows when a burst came in. I'm not advocating for this option - a lot of complexity that will rarely be exercised - but it's an option.

cc @erikgrinaker @sumeerbhola

erikgrinaker commented 3 years ago

Good point. I'll have to look into the split/merge machinery, but we need to handle this.

When that limit is exceeded, we compile all future rangedels down to point deletions.

I'm super-wary of these kinds of performance cliffs after the issues with ranged intent resolution. Let's try to find a solution that's isolated to splits and doesn't affect other operations.

sumeerbhola commented 3 years ago

An alternative is pre-rewriting the tombstones to reflect the split. That is, before we even start the split txn, we spend some time rewriting the range deletions in-place (through raft, because consistency checker). Then the split "presumably" has only a few of them left to split. But... perhaps not, who knows when a burst came in.

Can we add some state at the start of the split processing that ensures that DeleteRanges being written after the split starts are split before writing. That would prevent this burst.

sumeerbhola commented 3 years ago

And this discussion is similar to the fragmentation discussion in https://docs.google.com/document/d/1ItxpitNwuaEnwv95RJORLCGuOczuS2y_GoM2ckJCnFs/edit#heading=h.qdoeqq30zym7 (internal doc), and I believe has similar implications regarding latching, and physical (in)stability of MVCC key-value pairs when working with an Engine snapshot.

jbowens commented 3 years ago

When that limit is exceeded, we compile all future rangedels down to point deletions.

I'm super-wary of these kinds of performance cliffs after the issues with ranged intent resolution

Rather than using a discrete limit, could we make the heuristic responsible for deciding between laying down point and range tombstones adaptive? A heuristic that's a function of both the number of existing range tombstones and the number of consecutive points being deleted could turn the cliff into a continuous slope.

jbowens commented 3 years ago

And this discussion is similar to the fragmentation discussion in https://docs.google.com/document/d/1ItxpitNwuaEnwv95RJORLCGuOczuS2y_GoM2ckJCnFs/edit#heading=h.qdoeqq30zym7 (internal doc), and I believe has similar implications regarding latching, and physical (in)stability of MVCC key-value pairs when working with an Engine snapshot.

IIUC, if we figure out fragmentation, the range split case becomes easier and O(1) write contributions, because we only need to split at most the one range deletion fragment that spans the split boundary. However, with fragmentation, the batch size consideration is still relevant for every batch that needs to perform a DeleteRange, since in the worst case we need to rewrite every existing tombstone.

erikgrinaker commented 3 years ago

Rather than using a discrete limit, could we make the heuristic responsible for deciding between laying down point and range tombstones adaptive? A heuristic that's a function of both the number of existing range tombstones and the number of consecutive points being deleted could turn the cliff into a continuous slope.

We could, but in the limit this would still end up with table drops and index backfill cancellation being O(n). Also, consider someone who are about to run out of disk space, and would like to drop a table and trigger an explicit GC to free up space: if we have to lay down point tombstones over the entire table, that could end up exceeding the remaining disk space. Of course, in the limit, the same could be said for range tombstone fragmentation.

Then again, I suppose the heuristics could make sure these cases always end up dropping range tombstones. Especially in the case of table/index drops, since they'd always hit the full-range fast path. Maybe full-range tombstones are special enough that we'd even want to use separate keys for them that would never fragment.

IIUC, if we figure out fragmentation, the range split case becomes easier and O(1) write contributions, because we only need to split at most the one range deletion fragment that spans the split boundary.

It would still be O(versions) for the tombstone that straddles the range boundary. This is likely to be far fewer, but could still conceivably be enough to exceed the command size.

erikgrinaker commented 3 years ago

If we end up doing fragmentation (which seems increasingly compelling at this point), then -- as @tbg suggested above -- I think we should simply pre-fragment the tombstone before the split. Any further tombstone writes that come in after that would then automatically get fragmented across that boundary too, so the split shouldn't have to deal with it at all. This avoids the command size problem during the range split.

Of course, it just shifts the problem to dealing with the Pebble batch size, so we'll have to solve that. But given how critical range splits are, I think we want to do as little work as possible during them.

jbowens commented 3 years ago

If we end up doing fragmentation (which seems increasingly compelling at this point), then -- as @tbg suggested above -- I think we should simply pre-fragment the tombstone before the split.

This problem exists with or without fragmentation, right? With fragmentation, we'll need to split one tombstone key of size O(# versions). Without fragmentation, we'll need to split O(# tombstones) at the split point.

What about merges? Would merging tombstones happen after the merge is complete?

erikgrinaker commented 3 years ago

This problem exists with or without fragmentation, right? With fragmentation, we'll need to split one tombstone key of size O(# versions). Without fragmentation, we'll need to split O(# tombstones) at the split point.

With fragmentation, we can pre-determine the split point, and fragment the tombstones around the split point before processing the split trigger. Any range tombstones that are written in the meanwhile would then automatically get fragmented. Then the split trigger wouldn't need to process the range tombstones at all, as the fragments would fall into the correct sides of the split. The split is a pretty heavyweight operation, and takes out some hefty latches I believe, so it'd be nice to move that work outside of it.

What about merges? Would merging tombstones happen after the merge is complete?

I'd be inclined to drop fragment merging during range merges, and put it off until GC (we'd presumably merge fragments during GC to avoid them accumulating endlessly). If necessary, we could kick off a tombstone merge after the merge trigger has been processed, but I don't see a need to do it during the range merge.

tbg commented 3 years ago

Using fragmenting and fragmenting the single mvcc rangedel crossing the split point SGTM (there is a tiny race where between the fragmentation and the split the rangedel gets removed and replaced with one that crosses the split point again, but we just need to detect this I think, not necessarily deal with it). Not merging during merges SGTM as well.

I do wonder where the "always fragment" approach (which if I understand correctly we favor?) leaves us wrt the perf of the foreground operations. Are we banking on the TRUNCATE use case etc to always hit the fast path (i.e. there just isn't a rangedel on that range before)? Erik mentioned this above but I think it wasn't a further discussion:

Then again, I suppose the heuristics could make sure these cases always end up dropping range tombstones. Especially in the case of table/index drops, since they'd always hit the full-range fast path. Maybe full-range tombstones are special enough that we'd even want to use separate keys for them that would never fragment.

erikgrinaker commented 3 years ago

there is a tiny race where between the fragmentation and the split the rangedel gets removed and replaced with one that crosses the split point again

Yeah, if we only merge fragments during GC then we would have to synchronize range splits with GC. Or, as you say, detect it and back off to re-fragment.

I do wonder where the "always fragment" approach (which if I understand correctly we favor?) leaves us wrt the perf of the foreground operations. Are we banking on the TRUNCATE use case etc to always hit the fast path (i.e. there just isn't a rangedel on that range before)?

The fragment tradeoff makes range deletes slower for better read performance, so it should mostly affect range deletions. Although I suppose we'd be taking out a write latch over the range local tombstone span, which would block reads, so it'd affect them too.

I think we should have a fast path for TRUNCATE, so that we don't have to write fragmented tombstones. But that would probably require separate bookkeeping, and might affect read performance, so we should benchmark it.

erikgrinaker commented 2 years ago

Given the fragmented range key semantics, this work will fall out of the core MVCC range tombstone implementation in #70429. The only split/merge-specific handling is MVCC stats, and that will be covered as part of the MVCC stats implementation for #70429 anyway (see #78085), so I'm closing this.