Closed jcorbin closed 3 months ago
this was completely my decision to hide the max, thinking it would make it simpler to reason about for end users, but you're right that it also limits flexibility and more throttling comes at a significant cost.
so let's just expose the max :)
Just a heads up, the links to the linux docs, Linkedin 2016, Indeed and Dan Luu articles are missing
in my previous comment i meant expose the period.
Background: Linux CFS Throttling
In short, CFS CPU throttling works by:
In long:
Background: Aurae
Aurae uses CFS throttling to enforce cell CPU time quotes, similarly to Docker as described in the various backgound articles above.
However it's made the interesting to choice to hide the CFS throttling period, exposing only a max time quota field in its API. Furthermore, Aurae has hardcoded the CFS period to be 1s, which is 10x its typical default value of 100ms.
Problem: Large Latency Artifacts
The primary problem with how Aurae's CPU quotas currently is large latency artifacts:
See the example section below for code and excerpt data exploring this effect.
In the case of a request processing service, these are SLO breaking levels of latency. In fact, the typical 100ms CFS period is already material to such things.
Having even larger latency artifacts, now measured in the 600ms-900ms range, might even be bad enough to affect things like health checks and cluster traffic routing systems.
Proposal: at least expose CFS period ; maybe lower its default
At the very least, in my view, Aurae need to expose the CFS period alongside max CPU time.
I'm less convinced about lowering the default:
Example Code / Data
To confirm my own recollection of how this all works, and allow easy reproduction by others, I've written some examples programs in #406 :
GOMAXPROCS
when running it, since the Go runtime still is not container aware without a 3rd party library// NOTE
comment after itscells.start
call for instructions.Example Excerpt: Node.JS burning about 1 CPU core within a 400ms/1_000ms quota
After running for around 30 seconds, the node example program experiences 600-700ms latency excursions:
Corresponding kernel stats:
Example Excerpt: Go burning 4 CPU cores within a 2_000ms/1_000ms quota
Here's similar result from running an analogous Go program for around 30 seconds:
Here the actual encountered a little lower since the CPU quota is a little less oversubscribed; also the low end of the box stat may seem surprising, but is an artifact of how a constant-interval Go ticker behaves after encountering runtime lag; in other words, after coming out of a pause section, it delivered a couple ticks in rapid succession.
Corresponding kernel stats:
Example Excerpt: Go burning 8 CPU cores within a 400ms/1_000ms quota
For a final extreme example, here's an even more over-subscribed Go example:
Here there aren't any "outliers" under a classic boxplot analysis, because the 75%-ile is so heavily skewed up around 980ms.