aurae-runtime / aurae

Distributed systems runtime daemon written in Rust.
https://aurae.io
Apache License 2.0
1.85k stars 91 forks source link

Concerns About CPU Quota Implementation #407

Closed jcorbin closed 3 months ago

jcorbin commented 1 year ago

Background: Linux CFS Throttling

In short, CFS CPU throttling works by:

In long:

Background: Aurae

Aurae uses CFS throttling to enforce cell CPU time quotes, similarly to Docker as described in the various backgound articles above.

However it's made the interesting to choice to hide the CFS throttling period, exposing only a max time quota field in its API. Furthermore, Aurae has hardcoded the CFS period to be 1s, which is 10x its typical default value of 100ms.

Problem: Large Latency Artifacts

The primary problem with how Aurae's CPU quotas currently is large latency artifacts:

See the example section below for code and excerpt data exploring this effect.

In the case of a request processing service, these are SLO breaking levels of latency. In fact, the typical 100ms CFS period is already material to such things.

Having even larger latency artifacts, now measured in the 600ms-900ms range, might even be bad enough to affect things like health checks and cluster traffic routing systems.

Proposal: at least expose CFS period ; maybe lower its default

At the very least, in my view, Aurae need to expose the CFS period alongside max CPU time.

I'm less convinced about lowering the default:

Example Code / Data

To confirm my own recollection of how this all works, and allow easy reproduction by others, I've written some examples programs in #406 :

Example Excerpt: Node.JS burning about 1 CPU core within a 400ms/1_000ms quota

After running for around 30 seconds, the node example program experiences 600-700ms latency excursions:

[lag report] min:100 max:699 box:[ 101 101 276 ] hi:626 hiOutliers:24 2.42%
  ┌─────────┬───────┬────────┬──────────┐
  │ (index) │ start │ actual │ expected │
  ├─────────┼───────┼────────┼──────────┤
  │    0    │ 1815  │  697   │   100    │
  │    1    │ 5815  │  697   │   100    │
  │    2    │ 8815  │  697   │   100    │
  │    3    │ 9815  │  697   │   100    │
  │    4    │ 10815 │  697   │   100    │
  │    5    │ 12815 │  697   │   100    │
  │    6    │ 13815 │  697   │   100    │
  │    7    │ 17815 │  697   │   100    │
  │    8    │ 18815 │  697   │   100    │
  │    9    │ 21815 │  697   │   100    │
  │   10    │  815  │  698   │   100    │
  │   11    │ 2814  │  698   │   100    │
  │   12    │ 4814  │  698   │   100    │
  │   13    │ 7814  │  698   │   100    │
  │   14    │ 14814 │  698   │   100    │
  │   15    │ 16814 │  698   │   100    │
  │   16    │ 19814 │  698   │   100    │
  │   17    │ 22814 │  698   │   100    │
  │   18    │ 23814 │  698   │   100    │
  │   19    │ 3814  │  699   │   100    │
  │   20    │ 6813  │  699   │   100    │
  │   21    │ 11813 │  699   │   100    │
  │   22    │ 15813 │  699   │   100    │
  │   23    │ 20813 │  699   │   100    │
  └─────────┴───────┴────────┴──────────┘

Corresponding kernel stats:

$ cat /sys/fs/cgroup/cpu-burn-room/cpu.stat
usage_usec 12794354
user_usec 12704213
system_usec 90141
core_sched.force_idle_usec 0
nr_periods 31
nr_throttled 31
throttled_usec 18312824
nr_bursts 0
burst_usec 0

Example Excerpt: Go burning 4 CPU cores within a 2_000ms/1_000ms quota

Here's similar result from running an analogous Go program for around 30 seconds:

2023/02/14 15:58:51 [lag report] min:27.89854ms max:472.131789ms box:[ 99.880519ms 99.992654ms 100.151136ms ] hi:101.992654ms hiOutliers:12 12.0%
2023/02/14 15:58:51 {Start:2023-02-14 15:58:39.214184729 -0500 EST m=+20.401723712 End:2023-02-14 15:58:39.485175472 -0500 EST m=+20.672714508 Actual:270.990796ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:49.014233509 -0500 EST m=+30.201772457 End:2023-02-14 15:58:49.485037021 -0500 EST m=+30.672576095 Actual:470.803638ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:45.014041338 -0500 EST m=+26.201580281 End:2023-02-14 15:58:45.485087655 -0500 EST m=+26.672626688 Actual:471.046407ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:51.01440381 -0500 EST m=+32.201942761 End:2023-02-14 15:58:51.485465242 -0500 EST m=+32.673004332 Actual:471.061571ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:43.014046723 -0500 EST m=+24.201585667 End:2023-02-14 15:58:43.48512526 -0500 EST m=+24.672664308 Actual:471.078641ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:44.01403297 -0500 EST m=+25.201571914 End:2023-02-14 15:58:44.485149938 -0500 EST m=+25.672688966 Actual:471.117052ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:48.01406235 -0500 EST m=+29.201601293 End:2023-02-14 15:58:48.48522296 -0500 EST m=+29.672762000 Actual:471.160707ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:50.014036425 -0500 EST m=+31.201575373 End:2023-02-14 15:58:50.485200964 -0500 EST m=+31.672739934 Actual:471.164561ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:40.014035733 -0500 EST m=+21.201574682 End:2023-02-14 15:58:40.485247383 -0500 EST m=+21.672786397 Actual:471.211715ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:42.014066621 -0500 EST m=+23.201605565 End:2023-02-14 15:58:42.485391197 -0500 EST m=+23.672930199 Actual:471.324634ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:46.014037924 -0500 EST m=+27.201576868 End:2023-02-14 15:58:46.485420931 -0500 EST m=+27.672959951 Actual:471.383083ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:47.014097749 -0500 EST m=+28.201636694 End:2023-02-14 15:58:47.486229431 -0500 EST m=+28.673768483 Actual:472.131789ms Expected:100ms}

Here the actual encountered a little lower since the CPU quota is a little less oversubscribed; also the low end of the box stat may seem surprising, but is an artifact of how a constant-interval Go ticker behaves after encountering runtime lag; in other words, after coming out of a pause section, it delivered a couple ticks in rapid succession.

Corresponding kernel stats:

$ cat /sys/fs/cgroup/cpu-burn-room/cpu.stat
usage_usec 65863014
user_usec 60727707
system_usec 5135306
core_sched.force_idle_usec 0
nr_periods 32
nr_throttled 32
throttled_usec 32261496
nr_bursts 0
burst_usec 0

Example Excerpt: Go burning 8 CPU cores within a 400ms/1_000ms quota

For a final extreme example, here's an even more over-subscribed Go example:

2023/02/14 16:03:14 [lag report] min:17.276279ms max:982.717708ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:15 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:16 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:17 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:18 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:19 [lag report] min:17.276279ms max:982.906476ms box:[ 17.482207ms 100.660104ms 982.523742ms ] no outliers withing threshold:2ms

Here there aren't any "outliers" under a classic boxplot analysis, because the 75%-ile is so heavily skewed up around 980ms.

dmah42 commented 1 year ago

this was completely my decision to hide the max, thinking it would make it simpler to reason about for end users, but you're right that it also limits flexibility and more throttling comes at a significant cost.

so let's just expose the max :)

bnm3k commented 1 year ago

Just a heads up, the links to the linux docs, Linkedin 2016, Indeed and Dan Luu articles are missing

dmah42 commented 3 months ago

in my previous comment i meant expose the period.