Fork of QEMU, with support for pinning virtual cpus/cores/threads to the physical counterparts.
This project is "semi-discontinued", due to lack of resources (QEMU developers often refactor the codebase, which sometimes makes updating the patch non-trivial for a dev without domain knowledge).
I will sync master
with new QEMU versions, unless cases where the patch has conflicts that I can't easily solve; in such situations, I will open an issue tagged with help wanted
. Users can open issues for general problems, but I can't guarantee solving them (I'll also label such issues with help wanted
).
I maintain branches with all the previous patched versions, so if at any point in time, the latest QEMU hasn't been patched, an older version can be used.
Thanks to the user arcnx, who made the patch compatible with v6.2.
The user igrekster has kindly contributed a Python script for pinning the QEMU processes without patching QEMU. Users wishing to improve it can open a (phony) PR or contact me directly (find the contacts in my profile page).
The code was originally a QEMU 2.4.1 patch published on the QEMU mailing list; I've made significant fixes/cleanups/improvements, and I periodically rebase/update/extend it on top of the release QEMU versions.
Pinning is accomplished using the Linux interfaces cpu_set_t
(and related macros) and pthread_setaffinity_np
.
There is one addition (separated from pinning) in this fork:
And an option is provided to enable easy(er) to debug builds.
See the build script help (build_pinning_qemu_binary.sh --help
) for the details.
A script is provided in order to build the project on common Linux distros, and to enable additional features.
From the project root, run:
# Add `--help` to display the help.
#
./build_pinning_qemu_binary.sh
The project will build, and display the locations of the binaries (system/user emulators).
A few important notes:
./configure
command in the build script;pc-bios
directory for the x86-64 architecture).It's trivial to adjust the script to run it on other distributions.
Pinning is exposed through the -vcpu
commandline option; this is the generic structure of a pinned execution:
qemu-system-x86_64 \
-smp $total_vcpus,cores=$vcores,sockets=$vsockets,threads=$vthreads \
-vcpu vcpunum=$vcpu_number,affinity=$host_processor_number \
-vcpu vcpunum=$vcpu_number,affinity=$host_processor_number \
...
The following section will show common pinning scenarios, and convenient scripts that prepare them.
The scenarios in this section (except where specified) apply to a processor with the following topology (a Ryzen 7 3800x):
$ lstopo --of console --no-io --no-caches
Machine (31GB total) + Package L#0
NUMANode L#0 (P#0 31GB)
Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
Core L#1
PU L#2 (P#1)
PU L#3 (P#9)
Core L#2
PU L#4 (P#2)
PU L#5 (P#10)
Core L#3
PU L#6 (P#3)
PU L#7 (P#11)
Core L#4
PU L#8 (P#4)
PU L#9 (P#12)
Core L#5
PU L#10 (P#5)
PU L#11 (P#13)
Core L#6
PU L#12 (P#6)
PU L#13 (P#14)
Core L#7
PU L#14 (P#7)
PU L#15 (P#15)
Empyrically, based on the scripts presented here, QEMU exposes the threads (vcpus) sequentially, and Windows interprets physical processors as contiguous blocks.
Therefore, if one wants to, say, pin core 0 and its two threads, they need to pass:
-smp 2,cores=1,sockets=1,threads=2 -vcpu vcpunum=0,affinity=0 -vcpu vcpunum=1,affinity=8
Assign one virtual CPU per host core (not per thread):
CORES_NUMBER=$(lscpu --all -p=CORE | grep -v '^#' | sort | uniq | wc -l)
SMP_PARAMS=" -smp $CORES_NUMBER,cores=$CORES_NUMBER,sockets=1,threads=1"
for core_number in $(seq 1 $CORES_NUMBER); do
SMP_PARAMS+=" \\
-vcpu vcpunum=$((core_number - 1)),affinity=$((core_number - 1))"
done
echo "$SMP_PARAMS"
# -smp 8,cores=8,sockets=1,threads=1 \
# -vcpu vcpunum=0,affinity=0 \
# -vcpu vcpunum=1,affinity=1 \
# -vcpu vcpunum=2,affinity=2 \
# -vcpu vcpunum=3,affinity=3 \
# -vcpu vcpunum=4,affinity=4 \
# -vcpu vcpunum=5,affinity=5 \
# -vcpu vcpunum=6,affinity=6 \
# -vcpu vcpunum=7,affinity=7
A typical configuration is to give all hardware cores/threads to the guest, with the exception of one core.
First, one needs to obtain the host cpu layout; a simple way is:
$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 4200,0000 800,0000
1 0 0 1 1:1:1:0 yes 4200,0000 800,0000
2 0 0 2 2:2:2:0 yes 4200,0000 800,0000
3 0 0 3 3:3:3:0 yes 4200,0000 800,0000
4 0 0 0 0:0:0:0 yes 4200,0000 800,0000
5 0 0 1 1:1:1:0 yes 4200,0000 800,0000
6 0 0 2 2:2:2:0 yes 4200,0000 800,0000
7 0 0 3 3:3:3:0 yes 4200,0000 800,0000
CPU
represents a CPU from a Linux perspective, therefore, if the CPU supports SMT ("Hyper-threading" on Intel), each CPU
is a thread.
In this case, for example, CORE 0
will have two threads, represented by CPU 0
and CPU 4
.
With the configuration above, and the objective of passing all except one core, the pinning parameters are:
-smp 6,cores=3,sockets=1,threads=2
-vcpu vcpunum=0,affinity=1 -vcpu vcpunum=1,affinity=5
-vcpu vcpunum=2,affinity=2 -vcpu vcpunum=3,affinity=6
-vcpu vcpunum=4,affinity=3 -vcpu vcpunum=5,affinity=7
This will yield, in a Windows guest, 3 physical processors with 2 logical processors each, mapped to the host CPU
s (1,5), (2,6) and (3,7).
The configuration above can be be automated with:
# Exclude the core 0, and cluster the threads, sorted by (socket,core)
CPUS_DATA=$(lscpu --all --parse=SOCKET,CORE,CPU | grep -vP '^(#|0,0)' | sort -t ',' -n)
THREADS=$(echo "$CPUS_DATA" | wc -l)
CORES=$(echo "$CPUS_DATA" | cut -d ',' -f 2 | sort | uniq | wc -l)
SOCKETS=$(echo "$CPUS_DATA" | cut -d ',' -f 1 | sort | uniq | wc -l)
SMP_PARAMS=" -smp $THREADS,cores=$CORES,threads=$(($THREADS / $CORES))"
vcpu=0; while read cpu_entry; do
affinity=$(echo $cpu_entry | cut -d ',' -f 3)
SMP_PARAMS+=" \\
-vcpu vcpunum=$vcpu,affinity=$affinity"
vcpu=$(($vcpu + 1))
done <<< "$CPUS_DATA"
echo "$SMP_PARAMS"
# -smp 14,cores=7,threads=2 \
# -vcpu vcpunum=0,affinity=1 \
# -vcpu vcpunum=1,affinity=9 \
# -vcpu vcpunum=2,affinity=10 \
# -vcpu vcpunum=3,affinity=2 \
# -vcpu vcpunum=4,affinity=11 \
# -vcpu vcpunum=5,affinity=3 \
# -vcpu vcpunum=6,affinity=12 \
# -vcpu vcpunum=7,affinity=4 \
# -vcpu vcpunum=8,affinity=13 \
# -vcpu vcpunum=9,affinity=5 \
# -vcpu vcpunum=10,affinity=14 \
# -vcpu vcpunum=11,affinity=6 \
# -vcpu vcpunum=12,affinity=15 \
# -vcpu vcpunum=13,affinity=7
The following script, courtesy of Frédéric Pétrot (University of Grenoble), prepare a pinning for an arbitrary number of vCPUs, passed to the script:
#!/bin/bash
if test -z "$1"; then
echo "$0: Please provide the number of virtual cpus"
exit
fi
nvcpus=$1
# Chosing to sort in such a way that it is easier to see if there is a bug in the program :)
CPUS_DATA=$(lscpu --all --parse=SOCKET,CORE,CPU | grep -vP '^(#)' | sort -t ',' -k 1,1n -k 2,2n -k 3,3n)
declare CPUS_ENTRY
i=0; while read cpu_entry; do
CPUS_ENTRY[$i]=$cpu_entry
i=$(($i + 1))
done <<< "$CPUS_DATA"
cpus=$(nproc)
THREADS=$(echo "$CPUS_DATA" | wc -l)
CORES=$(echo "$CPUS_DATA" | cut -d ',' -f 2 | sort | uniq | wc -l)
SOCKETS=$(echo "$CPUS_DATA" | cut -d ',' -f 1 | sort | uniq | wc -l)
# A bit of a wild guess, ...
threads=$(($THREADS/$CORES))
cores=$(($(($nvcpus + 1))/$threads))
QEMU_SMP=" -smp $nvcpus,cores=$cores,threads=$threads"
for vcpu in $(seq 0 $(($nvcpus - 1))); do
affinity=$(echo ${CPUS_ENTRY[$(($vcpu%$cpus))]} | cut -d ',' -f 3)
QEMU_AFFINITIES="$QEMU_AFFINITIES \\
-vcpu vcpunum=$vcpu,affinity=$affinity"
done
echo "$QEMU_SMP $QEMU_AFFINITIES"
Result of vcpu-assign.sh 10
:
-smp 10,cores=5,threads=2 \
-vcpu vcpunum=0,affinity=0 \
-vcpu vcpunum=1,affinity=8 \
-vcpu vcpunum=2,affinity=1 \
-vcpu vcpunum=3,affinity=9 \
-vcpu vcpunum=4,affinity=2 \
-vcpu vcpunum=5,affinity=10 \
-vcpu vcpunum=6,affinity=3 \
-vcpu vcpunum=7,affinity=11 \
-vcpu vcpunum=8,affinity=4 \
-vcpu vcpunum=9,affinity=12
Pinning can be verified in many ways.
An easy one is to use htop
on the host, and the Microsoft-advised Cpu Stress tool on the guest, then rotate the affinity via Task Manager.
The procedure is:
htop
on the hostMaximum
activityDetails
tabCPUSTRES.EXE
Don't forget that the Windows CPU (thread) indexes won't match the Linux ones!
This patch should also support multi-socket CPUs.
Pinning does not necessarily improve the performance; in some cases, it may actually worsen, therefore, when enabling it, users should carefully test the impact on the given use case.
In general terms, the pinned processors should be excluded from the kernel scheduling (isolcpus
boot option). Under normal conditions, at least in terms of throughput, the Linux kernel does a good job at scheduling.
Performance testing with pinning has been the subject of the research paper "To Pin or Not to Pin: Asserting the Scalability of QEMU Parallel Implementation".
Since this fork needs to strictly adhere to the upstream git structure, a system of branches has been established, in order to keep the changes compartmentalized and easy to inspect/track.
The master
branch is always the latest QEMU stable version (generally, with a lag of a few days, unless there are specific issues), with the latest versions of the patches and the build script.
The fork commits are (in order from HEAD):
README.md
; gitignore updateThe repository contains the line of historical branches vX.Y.Z-pinning
, which are snapshots of previous stable versions, and the corresponding vX.Y.Z-pinning-changes
branches.
I've found libvirt to be a very interesting idea, but ultimately, a leaky abstraction:
virt-manager
) is poor (many entries must be edited via virsh edit
);Point 3 may be caused by my poor libvirt knowledge, but the fact that libvirt's functionality is built on top of QEMU always stands, and complex QEMU configurations are bound to have translation challenges.
I'm sure of course, that for simple setups, libvirt
+ virt-manager
may work very well.