ayufan / pve-helpers

A set of Proxmox VE scripts that aids with suspend/resume and cpu pinning
217 stars 33 forks source link

SMT pinning is broken/wrong #9

Open gnif opened 2 years ago

gnif commented 2 years ago

Hi, I do not use your scripts but we are seeing users in the Looking Glass discord that are who are having latency related issues due to how your script assigns CPUs to the VM.

The issue is that you are not replicating the host topology into the guest, if done properly the guest can know that the extra vCPU is sharing a core and even the L1/2/3 cache arrangement.

Here is how a guest sees a properly configured VM on a SMT host (Using Coreinfo). image

Doing this the guest scheduler can make wise decisions on where to run each thread. Obviously you need to pin each CPU properly to the threads of each core to make this work well. If done correctly your cache mapping will also align with the physical hardware... see below:

Here is my host topology (AMD EPYC 7343): image

My guest is pinned to CPU cores 8-15, which means

vCPU  0 &  1 = CPU  8 & 24
vCPU  2 &  3 = CPU  9 & 25
vCPU  4 &  5 = CPU 10 & 26
vCPU  6 &  7 = CPU 11 & 27
vCPU  8 &  9 = CPU 12 & 28
vCPU 10 & 11 = CPU 13 & 29
vCPU 12 & 13 = CPU 14 & 30
vCPU 14 & 15 = CPU 15 & 31

When done correctly you can see that my pinning aligns with the cache map, and allows the guest to make proper use of SMT. image

Note AMD processors require the qemu CPU flag topoext so they can use SMT. Note2: To get the cache to align you also have to set the QEMU cpu flags l3-cache=on,host-cache-info=on

Onepamopa commented 2 years ago

How do you output the cpu to cache map ?

gnif commented 2 years ago

I used lstopo on Linux for this graphic, and in Windows CoreInfo from SysInternals https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo

Onepamopa commented 2 years ago

btw, where do I set topoext & l3-cache=on,host-cache-info=on ?

gnif commented 2 years ago

Issues are to direct the author of this project to a problem with their software, not to provide you with support.

ayufan commented 2 years ago

Thank you @gnif. This is known. However, as docs says you should only pass physical threads, not virtual ones: https://github.com/ayufan/pve-helpers#21-cpu_taskset. And depending on CPU the mapping being different.

Maybe one thing being missing is documenting how to do with the L3, as when it was written there was no need to support NUMA/many-complexes scenario.

Technically it is possible to replicate all SMT topology, but at least I did not find it useful, or required to do a physical-to-virtual cpu-pinning of everything. Doing that is theoretically possible, but only libvirt supports that well.

gnif commented 2 years ago

@ayufan if I am understanding you correct, you're saying to put two VMs on the same set of cores, but separate threads? If so this is a very very bad idea, the VMs will stall each other and they will be invalidating each others caches.

According to your own documentation:

VM 1:
cpu_taskset 1-5

VM 2:
cpu_taskset 7-11

Based on the configuration there VM 1 would be on thread 1 of cores 1-5, and VM 2 would be on thread 2 of cores 1-5.

There is no such thing as a "virtual core" on the host system, both threads of a core are equal in every way, they are two identical pipelines running through and sharing some hardware that can cause them to stall each other. There is no "primary" thread, or "real" vs "virtual" thread.

If the guest OS knows about the SMT model, the guest scheduler can ensure that high priority threads like those that service interrupts for GPUs are put onto cores that can guarantee the best possible latency.

Note I am not stating this because I think it's a problem, I am stating this because it is a problem. We have people coming into our discord reporting issues with Looking Glass that are a result of very poor configuration that result due to this script. Looking Glass relies on low latency servicing of it's threads, and the GPUs driver as it's goal is to be as low latency as possible.

but at least I did not find it useful

This is just it, you did not due to your use case, but I am stating for a fact it makes a huge difference under certain workloads and you need to fix your scripts for those using such workloads, or stop promoting them.

ayufan commented 2 years ago

If so this is a very very bad idea, the VMs will stall each other and they will be invalidating each others caches.

You are fully correct. Of course they will. I can imagine this to be a problem in case of Looking Glass which requires effectively two systems to have low latency.

In my case where I don't use Looking Glass and rather use a single VM at a time, but have all of them running latency was not a problem, since other VM is mostly idle.

How you advise users to handle many VMs? Probably in this setup you expect VMs to not share physical cores, but rather pass full SMT core to them.

Anyway, I see this being a problem and happy to document those caveats. Do you have a link where best to redirect people?

gnif commented 2 years ago

In my case where I don't use Looking Glass and rather use a single VM at a time, but have all of them running latency was not a problem, since other VM is mostly idle.

In this case I would suggest you 1/2 the CPUs you give to your VMs and give them both threads of each cores, you will see a general performance uplift due to better management of your hardware.

Do you have a link where best to redirect people?

Not really as we are just supporting people reporting issues with LG. Perhaps the VFIO discord/reddit?