jcorbin commented 1 year ago

Background: Linux CFS Throttling

In short, CFS CPU throttling works by:

defining an on-CPU time quota over a given wallclock period
once a cgroup exceeds its quota within a period, it will not be scheduled at all until the next period
so the more CPU thirsty your workload is, the more you have to be mindful about staying within its time quota, or suffer potentially large latency artifacts due to being descheduled

In long:

Linux docs say much more
many others have written much more in depth about this:
- Linkedin 2016 -- has excellent timeline pictures
- Indeed 2019
- Dan Luu

Background: Aurae

Aurae uses CFS throttling to enforce cell CPU time quotes, similarly to Docker as described in the various backgound articles above.

However it's made the interesting to choice to hide the CFS throttling period, exposing only a max time quota field in its API. Furthermore, Aurae has hardcoded the CFS period to be 1s, which is 10x its typical default value of 100ms.

Problem: Large Latency Artifacts

The primary problem with how Aurae's CPU quotas currently is large latency artifacts:

once a cell has spent its max CPU time, none of its running processes will be scheduled for the rest of an entire second
let's use the common 400ms setting used in many of the examples, and assume a mundane 100% CPU utilization (1 OS CPU core): after running for 400ms, the cell then pauses for 600ms
the problem only gets worse the more parallel a cell's workload is: if the cell has 4 hot/running kernel tasks (threads, processes, VMs, whatever) then the cell will run for 100ms then pause for 900ms

See the example section below for code and excerpt data exploring this effect.

In the case of a request processing service, these are SLO breaking levels of latency. In fact, the typical 100ms CFS period is already material to such things.

Having even larger latency artifacts, now measured in the 600ms-900ms range, might even be bad enough to affect things like health checks and cluster traffic routing systems.

Proposal: at least expose CFS period ; maybe lower its default

At the very least, in my view, Aurae need to expose the CFS period alongside max CPU time.

I'm less convinced about lowering the default:

to me Aurae's choice to directly expose CFS time quota, rather than a more abstracted percentage or logical core count is a laudable one
to that end, I feel that users (especially ones for whom a letency SLO is a thing) should have to answer directly a question like "When my cell saturates its CPU quota, what is the time penalty we're willing to pay for that?"
to that end, I'd prefer to see every cell CPU allocation specifying both parts; but if we're not going to formally require it, having a default period so large as to be unacceptable for anyone's SLO would also do, since in practice it'll force everyone to specify a period eventually

Example Code / Data

To confirm my own recollection of how this all works, and allow easy reproduction by others, I've written some examples programs in #406 :

2 examples programs that burn as much CPU as they can as a saturation test
- the Node.JS one can only burn (a bit more than) 1 CPU core due to how Node/JavaScript work currently
- the Go version can burn as many cores as you've got, and indeed, one must specify GOMAXPROCS when running it, since the Go runtime still is not container aware without a 3rd party library
the supporting auraescript runner is a bit hacky, but gets the job done to demonstrate CPU quota saturation today; see it's // NOTE comment after its cells.start call for instructions.

Example Excerpt: Node.JS burning about 1 CPU core within a 400ms/1_000ms quota

After running for around 30 seconds, the node example program experiences 600-700ms latency excursions:

[lag report] min:100 max:699 box:[ 101 101 276 ] hi:626 hiOutliers:24 2.42%
  ┌─────────┬───────┬────────┬──────────┐
  │ (index) │ start │ actual │ expected │
  ├─────────┼───────┼────────┼──────────┤
  │    0    │ 1815  │  697   │   100    │
  │    1    │ 5815  │  697   │   100    │
  │    2    │ 8815  │  697   │   100    │
  │    3    │ 9815  │  697   │   100    │
  │    4    │ 10815 │  697   │   100    │
  │    5    │ 12815 │  697   │   100    │
  │    6    │ 13815 │  697   │   100    │
  │    7    │ 17815 │  697   │   100    │
  │    8    │ 18815 │  697   │   100    │
  │    9    │ 21815 │  697   │   100    │
  │   10    │  815  │  698   │   100    │
  │   11    │ 2814  │  698   │   100    │
  │   12    │ 4814  │  698   │   100    │
  │   13    │ 7814  │  698   │   100    │
  │   14    │ 14814 │  698   │   100    │
  │   15    │ 16814 │  698   │   100    │
  │   16    │ 19814 │  698   │   100    │
  │   17    │ 22814 │  698   │   100    │
  │   18    │ 23814 │  698   │   100    │
  │   19    │ 3814  │  699   │   100    │
  │   20    │ 6813  │  699   │   100    │
  │   21    │ 11813 │  699   │   100    │
  │   22    │ 15813 │  699   │   100    │
  │   23    │ 20813 │  699   │   100    │
  └─────────┴───────┴────────┴──────────┘

Corresponding kernel stats:

$ cat /sys/fs/cgroup/cpu-burn-room/cpu.stat
usage_usec 12794354
user_usec 12704213
system_usec 90141
core_sched.force_idle_usec 0
nr_periods 31
nr_throttled 31
throttled_usec 18312824
nr_bursts 0
burst_usec 0

Example Excerpt: Go burning 4 CPU cores within a 2_000ms/1_000ms quota

Here's similar result from running an analogous Go program for around 30 seconds:

2023/02/14 15:58:51 [lag report] min:27.89854ms max:472.131789ms box:[ 99.880519ms 99.992654ms 100.151136ms ] hi:101.992654ms hiOutliers:12 12.0%
2023/02/14 15:58:51 {Start:2023-02-14 15:58:39.214184729 -0500 EST m=+20.401723712 End:2023-02-14 15:58:39.485175472 -0500 EST m=+20.672714508 Actual:270.990796ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:49.014233509 -0500 EST m=+30.201772457 End:2023-02-14 15:58:49.485037021 -0500 EST m=+30.672576095 Actual:470.803638ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:45.014041338 -0500 EST m=+26.201580281 End:2023-02-14 15:58:45.485087655 -0500 EST m=+26.672626688 Actual:471.046407ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:51.01440381 -0500 EST m=+32.201942761 End:2023-02-14 15:58:51.485465242 -0500 EST m=+32.673004332 Actual:471.061571ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:43.014046723 -0500 EST m=+24.201585667 End:2023-02-14 15:58:43.48512526 -0500 EST m=+24.672664308 Actual:471.078641ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:44.01403297 -0500 EST m=+25.201571914 End:2023-02-14 15:58:44.485149938 -0500 EST m=+25.672688966 Actual:471.117052ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:48.01406235 -0500 EST m=+29.201601293 End:2023-02-14 15:58:48.48522296 -0500 EST m=+29.672762000 Actual:471.160707ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:50.014036425 -0500 EST m=+31.201575373 End:2023-02-14 15:58:50.485200964 -0500 EST m=+31.672739934 Actual:471.164561ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:40.014035733 -0500 EST m=+21.201574682 End:2023-02-14 15:58:40.485247383 -0500 EST m=+21.672786397 Actual:471.211715ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:42.014066621 -0500 EST m=+23.201605565 End:2023-02-14 15:58:42.485391197 -0500 EST m=+23.672930199 Actual:471.324634ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:46.014037924 -0500 EST m=+27.201576868 End:2023-02-14 15:58:46.485420931 -0500 EST m=+27.672959951 Actual:471.383083ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:47.014097749 -0500 EST m=+28.201636694 End:2023-02-14 15:58:47.486229431 -0500 EST m=+28.673768483 Actual:472.131789ms Expected:100ms}

Here the actual encountered a little lower since the CPU quota is a little less oversubscribed; also the low end of the box stat may seem surprising, but is an artifact of how a constant-interval Go ticker behaves after encountering runtime lag; in other words, after coming out of a pause section, it delivered a couple ticks in rapid succession.

Corresponding kernel stats:

$ cat /sys/fs/cgroup/cpu-burn-room/cpu.stat
usage_usec 65863014
user_usec 60727707
system_usec 5135306
core_sched.force_idle_usec 0
nr_periods 32
nr_throttled 32
throttled_usec 32261496
nr_bursts 0
burst_usec 0

Example Excerpt: Go burning 8 CPU cores within a 400ms/1_000ms quota

For a final extreme example, here's an even more over-subscribed Go example:

2023/02/14 16:03:14 [lag report] min:17.276279ms max:982.717708ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:15 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:16 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:17 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:18 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:19 [lag report] min:17.276279ms max:982.906476ms box:[ 17.482207ms 100.660104ms 982.523742ms ] no outliers withing threshold:2ms

Here there aren't any "outliers" under a classic boxplot analysis, because the 75%-ile is so heavily skewed up around 980ms.

dmah42 commented 1 year ago

this was completely my decision to hide the max, thinking it would make it simpler to reason about for end users, but you're right that it also limits flexibility and more throttling comes at a significant cost.

so let's just expose the max :)

bnm3k commented 1 year ago

Just a heads up, the links to the linux docs, Linkedin 2016, Indeed and Dan Luu articles are missing

dmah42 commented 3 months ago

in my previous comment i meant expose the period.

aurae-runtime / aurae

Concerns About CPU Quota Implementation #407