Is your feature request related to a problem? Please describe.
In a customer case, we saw almost 10 billion intents created over multiple ranges over a 5 hour window from a single INSERT INTO ... SELECT FROM statement that was ultimately killed. The MVCC GC queue ended up causing severe LSM inversion which ultimately caused a 30+ minute outage on a cluster. We need the ability to recover gracefully when we detect this type of problem.
Describe the solution you'd like
There are two high level approaches to gracefully recovering:
1) Long term - Admission control automatically pacing the rate of intent resolution to prevent IO overload
2) Short term - Some knob to manually rate limit intent resolution
Current the mvccGCQueue calls CleanupTxnIntentsOnGCAsync which ends up calling cleanupFinishedTxnIntents.
See #97108 for more details on the current status of AC and intent resolution.
Is your feature request related to a problem? Please describe.
In a customer case, we saw almost 10 billion intents created over multiple ranges over a 5 hour window from a single
INSERT INTO ... SELECT FROM
statement that was ultimately killed. The MVCC GC queue ended up causing severe LSM inversion which ultimately caused a 30+ minute outage on a cluster. We need the ability to recover gracefully when we detect this type of problem.Describe the solution you'd like There are two high level approaches to gracefully recovering: 1) Long term - Admission control automatically pacing the rate of intent resolution to prevent IO overload 2) Short term - Some knob to manually rate limit intent resolution
Current the
mvccGCQueue
callsCleanupTxnIntentsOnGCAsync
which ends up callingcleanupFinishedTxnIntents
.See #97108 for more details on the current status of AC and intent resolution.
Jira issue: CRDB-44794