Open nvanbenschoten opened 3 years ago
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
https://github.com/cockroachdb/cockroach/issues/22349 and more recently https://github.com/cockroachdb/cockroach/issues/41720#issuecomment-549911840 detail a proposal for eliminating all latching during intent resolution. This issue explores a way in which we could make ranged intent resolution less disruptive to foreground traffic even before we eliminate all latching.
Currently, intent resolution acquires a write latch across its entire span at the target transaction's min_timestamp: https://github.com/cockroachdb/cockroach/blob/bbadd88ba6e0a6fcbec06e7c0c8c1c637671dda4/pkg/kv/kvserver/batcheval/cmd_resolve_intent.go#L45
Because this is a write latch and because it has an early timestamp, this effectively blocks all traffic across the entire resolution span.
For point intent resolution, this is unfortunate but not terribly so. Anyone who conflicts with the point latch would have otherwise conflicted with the intent being resolved anyway, so it wasn't introducing any new coordination between background and foreground processes. Additionally, if the intent had already been resolved and the intent resolution request was redundant, the request would perform a point read, notice the missing intent, skip Raft, and release its latches quickly.
For ranged intent resolution, things are worse for a few reasons. First, this is disruptive to other requests in the same span, even if those would not otherwise conflict with the intents being resolved. This means that ranged intent resolution introduces new coordination between background and foreground processes that didn't otherwise already exist. So a ranged intent resolution request scanning from
[a, d)
to resolve intents at keya
andc
will block a read or write at keyb
. Second, because ranges intent resolution requires an expensive scan, ranged intent resolution requests are disruptive even when they are redundant. This second point is why https://github.com/cockroachdb/cockroach/issues/66741 got so bad. https://github.com/cockroachdb/cockroach/pull/66268 went a long way to make ranged intent resolution cheaper, so we're in a better spot than we were before, but these two problems still remain.Even before removing all latching from intent resolution, we could make ranged intent resolution less disruptive by avoiding ranged write latches. Ranged intent resolution can be thought of a two-step process:
If we handled each of these steps discretely, we could be more selective about when we hold latches and how disruptive we are to other traffic. For instance, an alternative approach to ranged intent resolution would be to evaluate it as:
This works because point intent resolution already handles missing intents properly, so there are no races here.
It also follows that latches may not even be necessary for the initial scan at all, as long as it is still properly synchronized with range lifecycle events (splits, merges, rebalances, etc). We'll have to solve this problem for latch-less intent resolution, so we should be aware of it here as well.
Jira issue: CRDB-9956