In an internal test cluster, when we had unbounded snapshot ingests, we discovered that Store.HandleSnapshot function showed high CPU usage and eventually led to an increased goroutine scheduler latency that ultimately led to spikes in SQL latency.
In an internal thread, we discussed that the ideal solution for this would be to use the elastic CPU limiter for such work since it was impacting the scheduler latency. Since this work is not technically elastic work, we would need to tweak the CPU limiter to also handle regular traffic and support higher than 1ms thresholds for pacing.
In an internal test cluster, when we had unbounded snapshot ingests, we discovered that
Store.HandleSnapshot
function showed high CPU usage and eventually led to an increased goroutine scheduler latency that ultimately led to spikes in SQL latency.In an internal thread, we discussed that the ideal solution for this would be to use the elastic CPU limiter for such work since it was impacting the scheduler latency. Since this work is not technically elastic work, we would need to tweak the CPU limiter to also handle regular traffic and support higher than 1ms thresholds for pacing.
CPU profile attached. cpuprof.2024-04-30T18_09_32.630.102.pprof.zip Some metric from when the overload happened can be found here.
Jira issue: CRDB-38467
Epic CRDB-42958