Is your feature request related to a problem? Please describe.
We use a rate limiter per-node to control tenant CPU activity for tenant-observed perf predictability. It tries to control a tenant's CPU usage to around 20% of a KV node. To do so, it uses a model of CPU activity by looking at a few variables: # of operations (read or write) and size of each operation. The linear model's constants were derived from experimental data (internal link) that aimed to predict # of cores needed for stock workloads by recording aforementioned variables for each run at steady state.
Some of this approach was predicated on the lack of measurable on-CPU time from the Go scheduler (see: https://github.com/golang/go/issues/41554). We opted then to fallback on a coarse model of the CPU. We've found however that this model can have a high error margin. In a recent escalation (internal link; discussion here) we found that it was possible for a single tenant to consume more than its fair share of KV CPU. Reproducing it with a secondary tenant with tenant-side cost controls disabled + running large table scans, we were able to get the single tenant to sustain a high CPU load in KV (burstiness was an artifact of manually running the table scans).
Describe the solution you'd like
When admitting a request we probably do want some model of predicted usage. After having processed it however, if we were able to measure precise CPU time, we should be able to send our tenant-scoped rate limiter in debt for future requests. We already have control flows for this when recording post-hoc the total bytes read. There are various ideas for precise measurement of on-CPU time:
https://github.com/cockroachdb/cockroach/pull/60589 explores the idea of introducing a "task group" abstraction to the Go runtime and accumulating the total nanoseconds each goroutine spends in the runnable state (as observed by the Go scheduler) into the goroutine's surrounding task group.
https://github.com/golang/go/pull/51347 is a tiny adaptation of the same idea but still tracking nanos at the level of individual goroutines. irfansharif/runner is a prototype of how we could accumulate total on-cpu time across a set of coordinating goroutines in the our own libraries without it needing to sit within the runtime.
https://github.com/cockroachdb/cockroach/pull/60508 makes use of profiler labels and sampling to attribute CPU usage to SQL queries and session. Notably, it does not require runtime changes but with all things profiler-labels, it comes with some overhead.
We could use epbf probes into the runtime to trace goroutine scheduling events, and maintain a similar kind of nanos counter on our end. This would be linux only but maybe palatable for on-prem environments.
I think we should move towards precisely attributed on-cpu time and fleshing out libraries to make this pattern easier. https://github.com/cockroachdb/cockroach/issues/58164 is relevant. After looking at the runtime changes needed (<30 lines), I think the upside of doing it far outweighs giving up on precise measurements. We'll still try to upstream the change, but even if it doesn't land/takes long, give we're using Bazel to build CRDB, it's trivial to point to a mirrored runtime with our changes. In fact, we already do. This would mean that engineers wouldn't have to maintain their own go distribution manually on their machines; with the right set of build tags we can make this "just work".
Describe alternatives you've considered
See above.
Other benefits of measured on-CPU time
I want this issue to focus on just using measured CPU time for rate limiting. It has implications for KV stability and tail performance in a multi-tenant system. That said, there are other benefits to moving towards measured on-cpu time instead of modelling or using proxies:
Better observability into exactly what's hot and what isn't. We're building out a key-visualizer for e.g. on top of what is a very flawed metric (see this and this). Instead of visualizing proxies (QPS) over base signals (CPU), it'd be more accurate to do the latter sort.
Would help get per-request CPU numbers for the SQL process.
More accurate costing for tenants. We use a tenant cost model for tenant-side flow control and billing reasons that also tries to predict CPU costs. There's a desire to decouple this from actual CPU measurements because of variance the latter approach could have. Your goroutines could be co-opted by the Go GC and be costlier, or in a shared system depending on the active neighbors on machines you're running on, things could just be more expensive. The unpredictability makes for bad UX. All that said, even if we don't bill based on measured usage, we should account it ourselves to know what patterns we're subsidizing and what we aren't. Scanning over MVCC garbage for e.g. is unaccounted for today in our RU model, but if we had precise measurements to inform us how much it costs, we could tune our markups accordingly.
Better allocation decisions. We're currently using a collapsed "QPS" proxy signal for CPU utilization when deciding to move replicas from one node to another. When running into limitations of this proxy (like AddSSTableRequests), we sensitize specific bits through magic constants. If we had precise on-CPU time on a per-replica basis, and allocated based on that, I imagine it would obviate the need to "cost" each request in this manner.
Is your feature request related to a problem? Please describe.
We use a rate limiter per-node to control tenant CPU activity for tenant-observed perf predictability. It tries to control a tenant's CPU usage to around 20% of a KV node. To do so, it uses a model of CPU activity by looking at a few variables: # of operations (read or write) and size of each operation. The linear model's constants were derived from experimental data (internal link) that aimed to predict # of cores needed for stock workloads by recording aforementioned variables for each run at steady state.
Some of this approach was predicated on the lack of measurable on-CPU time from the Go scheduler (see: https://github.com/golang/go/issues/41554). We opted then to fallback on a coarse model of the CPU. We've found however that this model can have a high error margin. In a recent escalation (internal link; discussion here) we found that it was possible for a single tenant to consume more than its fair share of KV CPU. Reproducing it with a secondary tenant with tenant-side cost controls disabled + running large table scans, we were able to get the single tenant to sustain a high CPU load in KV (burstiness was an artifact of manually running the table scans).
Describe the solution you'd like
When admitting a request we probably do want some model of predicted usage. After having processed it however, if we were able to measure precise CPU time, we should be able to send our tenant-scoped rate limiter in debt for future requests. We already have control flows for this when recording post-hoc the total bytes read. There are various ideas for precise measurement of on-CPU time:
I think we should move towards precisely attributed on-cpu time and fleshing out libraries to make this pattern easier. https://github.com/cockroachdb/cockroach/issues/58164 is relevant. After looking at the runtime changes needed (<30 lines), I think the upside of doing it far outweighs giving up on precise measurements. We'll still try to upstream the change, but even if it doesn't land/takes long, give we're using Bazel to build CRDB, it's trivial to point to a mirrored runtime with our changes. In fact, we already do. This would mean that engineers wouldn't have to maintain their own go distribution manually on their machines; with the right set of build tags we can make this "just work".
Describe alternatives you've considered
See above.
Other benefits of measured on-CPU time
I want this issue to focus on just using measured CPU time for rate limiting. It has implications for KV stability and tail performance in a multi-tenant system. That said, there are other benefits to moving towards measured on-cpu time instead of modelling or using proxies:
Jira issue: CRDB-13382