kv: ability to gracefully recover from large intent buildup

Is your feature request related to a problem? Please describe.

In a customer case, we saw almost 10 billion intents created over multiple ranges over a 5 hour window from a single INSERT INTO ... SELECT FROM statement that was ultimately killed. The MVCC GC queue ended up causing severe LSM inversion which ultimately caused a 30+ minute outage on a cluster. We need the ability to recover gracefully when we detect this type of problem.

Describe the solution you'd like There are two high level approaches to gracefully recovering: 1) Long term - Admission control automatically pacing the rate of intent resolution to prevent IO overload 2) Short term - Some knob to manually rate limit intent resolution

Current the mvccGCQueue calls CleanupTxnIntentsOnGCAsync which ends up calling cleanupFinishedTxnIntents.

See #97108 for more details on the current status of AC and intent resolution.

Jira issue: CRDB-44794

cockroachdb / cockroach

kv: ability to gracefully recover from large intent buildup #135934