canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

VM CPU auto pinning causes slowdowns and stealtime #14133

Closed pkramme closed 2 weeks ago

pkramme commented 1 month ago

Required information

Issue description

The introduction of automatic core scheduling has led to significant decrease in performance in our infrastructure, with weird problems that make no sense if you are not aware of this issue, such as:

The current CPU scheduler doesn't seem to understand hardware topology, which is really surprising, considering that many new CPUs are now asymmetric and that on the kernel side much work is being done making sure that workload is put on "the best core for the job" with features like AMD Preferred Core and equivalents or new CPU schedulers like EEVDF.

It seems weird to put these placement decisions in LXD and turn them on by default, without offswitch when LXD does not consider that this might cause significant problems. LXD simply has not enough data, and static round robin placement is simply too simple.

From our perspective this is a significant design error for this feature, and we ask that this feature is either

  1. reworked so that hardware topology is accurately picked up, including L3 cache differences, CCD layouts, preferred core data, etc
  2. enhanced with an option to turn it completely off
  3. turned off by default
  4. removed.

Additionaly, snaps auto update mechanism has introduced this new feature to our infrastructure (which by itself is fine), and we'd ask you to consider that features like this are being continously applied to real workloads and while not being LTS, should still be at least not harmful.

Information to attach

Our current hardware topology has two L3 domains that have different sizes. Our VMs run typical web application workloads. The core load balancing has put multiple CPU bound cores on one physical core, leading to the weird stealtime above.

# lstopo
Machine (125GB total)
  Package L#0
    NUMANode L#0 (P#0 125GB)
    L3 L#0 (96MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#16)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#17)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#18)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#19)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#20)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#21)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#22)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#23)
    L3 L#1 (32MB)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#24)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#25)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#26)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#27)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#28)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#29)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#30)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#31)
pkramme commented 1 month ago

We threw together a quick script to visualize that problem:

0:  seg18-app1, seg18-mysql1
1:  seg17-app1, seg18-lb1
2:  seg19-app1, seg19-redis1
3:  seg19-mysql1
4:  seg17-app1
5:  seg17-mysql1, seg18-redis1
6:  seg18-redis1, seg19-mysql1
7:  seg19-app1, seg19-mysql1
8:  seg18-app1, seg19-app1
9:  seg18-app1, seg19-app1
10: seg17-lb1, seg18-mysql1
11: seg19-mysql1
12: seg19-app1, seg19-redis1
13: seg18-app1, seg18-mysql1
14: seg17-app1, seg17-redis1
15: seg18-mysql1
16: seg19-mysql1
17: seg19-app1, seg19-lb1
18: seg17-app1, seg17-redis1
19: seg17-mysql1, seg19-app1
20: seg18-app1, seg18-mysql1
21: seg19-mysql1
22: seg18-app1, seg18-mysql1
23: seg19-mysql1
24: seg19-mysql1
25: seg17-lb1, seg18-mysql1
26: seg18-mysql1
27: seg17-app1, seg18-app1
28: seg17-app1, seg18-app1
29: seg17-app1, seg19-app1
30: seg17-app1, seg18-lb1
31: seg19-lb1

Cores:

General rule with this system is:

This core placement puts very latency critical systems on the same (hyper)core, while leaving systems that have no real load on their own core. Even if this was a completely symmetrical CPU and even if all of those cores weren't hypercores, this would still waste resources when those VMs aren't equally loaded.

tomponline commented 1 month ago

Thanks for your detailed report!

Yeah this was an area of concern originally:

Note: On systems that have mixed performance and efficiency cores (P+E) you may find that VM performance is decreased due to the way LXD now pins some of the VM’s vCPUs to efficiency cores rather than letting the Linux scheduler dynamically schedule them. You can use the explicit CPU pinning feature if needed to avoid this.

https://discourse.ubuntu.com/t/lxd-6-1-has-been-released/46259#vm-automatic-core-pinning-load-balancing

But we are considering option 2 and 3 of your suggestions.

pkramme commented 1 month ago

Thank you for your quick response! Would it be possible to get a patch for the 6.1 series that would give us the option to turn this off? Otherwise we'd have to write tooling to repin the VMs based on things like stealtime or cpu pressure. We'd much rather just let the kernel handle it.

tomponline commented 1 month ago

Thank you for your quick response! Would it be possible to get a patch for the 6.1 series that would give us the option to turn this off? Otherwise we'd have to write tooling to repin the VMs based on things like stealtime or cpu pressure. We'd much rather just let the kernel handle it.

The latest 5.21/stable LTS series does not have this feature (on purpose because it changes the default behaviour) so you could try that. Its more suitable for production purposes anyway as the latest/stable channel is the moving feature release channel and doesn't support downgrades.

See https://documentation.ubuntu.com/lxd/en/latest/installing/#installing-release

The 6.1 release won't get patches now, but will be replaced by 6.2, 6.3 etc. Hopefully we can land the new settings in one of those 2 releases, but 6.2 is imminent and might not make it in there.

tomponline commented 1 month ago

Chatting with @morphis and he proposes having a new setting:

limits.cpu.pin_strategy=[none|auto]

Where none will disable auto pinning (the new default) and auto would be the current default behaviour for 6.1.

pkramme commented 1 month ago

This would be great. We've begun reverting to 5.21, but having the option to disable this is great, especially when we'd eventually use the next LTS release or when staying on the feature branches. Thanks a lot for your work so far @tomponline and @kadinsayani!