Back when we added this metrics as way of determining overload, in practice we would see values 10 or even more in non-overloaded clusters. We only saw degradation once this metric exceeded 30-50 or so.
Recently there was an observation that a customer was looking at this metric and it was seeing values of only 2 on overloaded clusters. I checked a running DRT cluster that was running TPCC at 100% CPU usage (with fairly high query latencies), and this value was between 2 and 6 (which I found very surprising).
It's possible there was some change inside the Go scheduler that changes when goroutines become runnable. We should investigate if there is a difference here between recent releases. If we find a difference, we may need to update the values we use for admission control.
We could run kv95 single-node and observe runnable counts. Compare it with older builds (close to the go version bump boundaries). Additionally, there might have been changes in the go runtime.
Back when we added this metrics as way of determining overload, in practice we would see values 10 or even more in non-overloaded clusters. We only saw degradation once this metric exceeded 30-50 or so.
Recently there was an observation that a customer was looking at this metric and it was seeing values of only 2 on overloaded clusters. I checked a running DRT cluster that was running TPCC at 100% CPU usage (with fairly high query latencies), and this value was between 2 and 6 (which I found very surprising).
It's possible there was some change inside the Go scheduler that changes when goroutines become runnable. We should investigate if there is a difference here between recent releases. If we find a difference, we may need to update the values we use for admission control.
CC @sumeerbhola @aadityasondhi
Jira issue: CRDB-43227