Open erikgrinaker opened 3 years ago
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
As part of the intent buildup investigation in #60585 we found that stale intents can often be left behind when we exceed the async task pool in the intent resolver and do synchronous intent resolution connected to the client's context as a way to apply backpressure:
https://github.com/cockroachdb/cockroach/blob/a4b7a431093870393cac9f11287eaf282abdbd22/pkg/kv/kvserver/intentresolver/intent_resolver.go#L430-L435
However, if the client goes away, it will stop cleaning up intents. Similarly, as seen in #64770, if a client disconnects in the middle of a transaction the rollback gets a timeout to clean up the intents which can cause stray intents to be left behind. There was a similar bug with transaction record cleanup in #64868.
We have two conflicting goals here: try to clean up intents, but also try to avoid spawning too many cleanup goroutines and backpressure clients if we're overloaded.
Since we want intent cleanup to be reliable without overloading the system, and it isn't particularly latency-sensitive, it seems like a better solution would be to use a queue with a bounded worker pool. This would greatly simplify intent cleanup logic, which today is spread across many separate components with interactions that can be difficult to reason about. It could also possibly replace the current intent GC during MVCC GC, which has some problems of its own (#64266). It may be possible to use txn records as a (fallback) persisted queue, since these are range-local (and thus can possibly be scanned efficiently) and also contain the intent spans.
This would need some way to apply backpressure if the cleanup queue becomes too large. However, this cannot rely solely on backpressuring clients during their own txn cleanup, since the client can simply disconnect -- this will in particular be the case when the client disconnects from an in-progress txn, which still needs to be cleaned up but there's no way for us to backpressure that client.
Related to #65191.
Jira issue: CRDB-7485