Open DemiMarie opened 2 years ago
I can confirm that this problem is clearly noticeable. My Fedora AppVM with up to 8 GB of memory and all cores assigned to it, runs much slower than a native Fedora install on bare metal. Compiling code takes roughly twice a long. I first blamed the lack of hyperthreading, but passing smt=on sched-gran=core
to xen makes no difference in my benchmarks (i7-3632qm):
Benchmark | Slowdown compared to native Fedora |
---|---|
sysbench cpu run |
60% |
sysbench memory run |
1680% |
Building some random C++ projects | 58% |
Watching 1080p@60fps YouTube videos on Qubes OS boils my laptop and doesn't feel like 60fps. Native Fedora handles this much better, despite using only CPU-based video decoders[1].
Startup latency of apps on Qubes OS is much worse. qvm-run personal echo "hello world"
takes almost an entire second.
[1] I'm not 100% sure it's CPU-only. But according to htop it hits all my cores really hard, while still being smoother and much cooler than Qubes OS.
Startup latency of apps on Qubes OS is much worse.
qvm-run personal echo "hello world"
takes almost an entire second.
This is definitely a problem and we’re working on it. It won’t ever be as fast as on bare hardware, but our goal is to go from qvm-run
command to the VM spawning init in under 100ms. Runtime CPU performance should be within a few percent of bare silicon, so that it is not is definitely a bug.
I can confirm that this problem is clearly noticeable. My Fedora AppVM with up to 8 GB of memory and all cores assigned to it, runs much slower than a native Fedora install on bare metal. Compiling code takes roughly twice a long. I first blamed the lack of hyperthreading, but passing
smt=on sched-gran=core
to xen makes no difference in my benchmarks (i7-3632qm):
I suggest reverting this, as it is not security supported upstream. The fact that it did not help your benchmarks indicates that it is not likely to be the culprit.
Benchmark Slowdown compared to native Fedora sysbench cpu run
60% sysbench memory run
1680% Building some random C++ projects 58%
Yeah that’s not good. For clarification: if native Fedora takes time X to build C++ projects, does this mean Qubes OS takes (X / (1 - 0.58)) time? If you could post the raw benchmark data, that would be very helpful.
sysbench memory run
is particularly concerning. Does turning off memory balancing for the qube help?
I can't exactly reproduce this. More information on the workload is needed I think. In my test, I used Linpack Xtreme. On Qubes, I have SMT enabled, 12 vCPUs in my case and 12GB of RAM assigned to the VM (although linpack only uses about 2GB). The VM is PVH with memory balancing enabled. Everything else default for Xen and kernel-latest. Benchmark VM is the only VM running while testing. CPU frequency started at about 3.5GHz and went down to ~2.1GHz over duration of test. My result was 100 GFlops.
Then I started a fedora 36 iso. However, CPU frequency started at 3.9GHz and went down to about 2.4 or 2.5. My result was 112 GFlops.
Perhaps your CPU is not boosting correctly like mine does?
sysbench memory run
is particularly concerning. Does turning off memory balancing for the qube help?
I turned off memory balancing via Qubes Manager and assigned fixed 10 GB of memory to the Qube. No improvement :confused:
For clarification: if native Fedora takes time X to build C++ projects, does this mean Qubes OS takes (X / (1 - 0.58)) time?
Yes! I don't have the raw benchmark data anymore, but I'm pretty sure it's the same problem that causes the sysbench deviations.
The result of sysbench inside an AppVM heavily dependends on how many other AppVMs are running. Running only one single AppVM brings the results from sysbench cpu run
pretty damn close to the values I get on native Fedora. I assume it's some form of scheduling issue. Running only one single AppVM also improves sysbench memory run
results, but it's still way off compared to native Fedora.
I tried running the memory benchmark in dom0, where it is significantly faster. Here the results of sysbench memory run
:
Environment | MiB/sec |
---|---|
Native Fedora | 5290 |
dom0 | 4525 |
domU (only 1 AppVM) | 401 |
domU (+7 other AppVMs) | 267 |
I'm not sure if there is anything special about my system. My entire xen and kernel setup is pretty vanilla. Only deviation is qubes.enable_insecure_pv_passthrough
because I don't have an IOMMU. Enabling/Disabling this flag makes no difference.
@AlxHnr Can you try sysbench memory run
in a PV VM (not dom0, sys-net, or sys-usb)?
Is the domU a PVH (the default on qubes, if no PCI devices)? What CPU that is?
Is the domU a PVH (the default on qubes, if no PCI devices)?
Yes. All my domU's are PVH (default), except sys-usb and sys-net.
@AlxHnr Can you try
sysbench memory run
in a PV VM (not dom0, sys-net, or sys-usb)?
VM Type | MiB/sec |
---|---|
PVH | 243.81 |
HVM | 216.34 |
PV | 54.41 |
Giving the PV VM more cores and memory makes no difference. PV VMs are slow and laggy to the point of being unusable.
What CPU that is?
i7-3632qm. It supports VT-d, but my motherboard/BIOS/whatever does not.
I hope this problem is not specific to my setup. My goal here is to get to a point where others can reproduce these problems. I don't have much time and care less about temporary fixes for myself. I care more about achieving sane defaults that work for everybody.
I'm seeing about an 8x slowdown in sysbench memory run on a domU PVH vs. dom0 on my ancient quad Sandy Bridge.
B
Is the domU a PVH (the default on qubes, if no PCI devices)?
Yes. All my domU's are PVH (default), except sys-usb and sys-net.
@AlxHnr Can you try
sysbench memory run
in a PV VM (not dom0, sys-net, or sys-usb)?
VM Type MiB/sec PVH 243.81 HVM 216.34 PV 54.41 Giving the PV VM more cores and memory makes no difference. PV VMs are slow and laggy to the point of being unusable.
dom0 is itself a PV VM, so that is strange.
@andyhhp: Do you what could cause such a huge difference between PV dom0 and PV domU? Are super pages only allowed to be used by dom0?
What CPU that is?
i7-3632qm. It supports VT-d, but my motherboard/BIOS/whatever does not.
I hope this problem is not specific to my setup. My goal here is to get to a point where others can reproduce these problems. I don't have much time and care less about temporary fixes for myself. I care more about achieving sane defaults that work for everybody.
Me too.
Are super pages only allowed to be used by dom0?
PV guests cannot use superpages at all. dom0 doesn't get them either.
Do you what could cause such a huge difference between PV dom0 and PV domU?
Numbers this bad are usually PV-L1TF and IvyBridge is affected, but Qubes has SHADOW compiled out so it's not that. Do you have xl dmesg
from the system? I'm rather lost for ideas.
Are super pages only allowed to be used by dom0?
PV guests cannot use superpages at all. dom0 doesn't get them either.
Makes sense, I see that superpage support on PV got ripped out in 2017. Not surprising in retrospect, considering that at least two of the fatal flaws in PV were due to it.
Do you what could cause such a huge difference between PV dom0 and PV domU?
Numbers this bad are usually PV-L1TF and IvyBridge is affected, but Qubes has SHADOW compiled out so it's not that. Do you have
xl dmesg
from the system? I'm rather lost for ideas.
@AlxHnr Can you provide xl dmesg
? That should give the Xen log. Please be sure to redact any sensitive information before posting it.
Numbers this bad are usually PV-L1TF and IvyBridge is affected, but Qubes has SHADOW compiled out so it's not that. Do you have xl dmesg from the system? I'm rather lost for ideas
Just as an aside, under R4.0 on Sandy Bridge xl dmesg says:
PV L1TF shadowing: Dom0 disabled, DomU enabled
Just checked R4.1 on i7-8850H and same result.
B
PV L1TF shadowing: Dom0 disabled, DomU enabled
That’s normal. The L1TF mitigation code enables shadow paging if the hypervisor was built with that, or calls domain_crash()
otherwise.
@fepitre can you provide an xl dmesg
from a machine that has performance problems under Xen?
Just checked R4.1 on i7-8850H and same result.
i7-8750H here, about the same result. xl dmesg
Just checked R4.1 on i7-8850H and same result.
i7-8750H here, about the same result. xl dmesg
Thanks! Would you mind posting sysbench results?
Hmm - sadly nothing helpful there. Not terribly surprising as it's a release hypervisor, but that's no guarantee that a debug Xen would be any more helpful.
As an unrelated observation, @marmarek you can work around:
(XEN) parameter "no-real-mode" unknown!
by backporting xen-project/xen@e44d986084760 and xen-project/xen@e5046fc6e99db which will silence the spurious warning.
I've tried Xen with spec-ctrl=no
but nothing changed (7128.94 MiB/sec
in dom0, 769.83 MiB/sec
in domU).
Relevant Xen messages:
(XEN) Speculative mitigation facilities:
(XEN) Hardware hints: RSBA
(XEN) Hardware features: IBPB IBRS STIBP SSBD L1D_FLUSH MD_CLEAR SRBDS_CTRL
(XEN) Compiled-in support: INDIRECT_THUNK
(XEN) Xen settings: BTI-Thunk JMP, SPEC_CTRL: IBRS- STIBP- SSBD-, Other: SRB_LOCK-
(XEN) L1TF: believed vulnerable, maxphysaddr L1D 46, CPUID 39, Safe address 8000000000
(XEN) Support for HVM VMs: MD_CLEAR
(XEN) Support for PV VMs: MD_CLEAR
(XEN) XPTI (64-bit PV only): Dom0 disabled, DomU disabled (with PCID)
(XEN) PV L1TF shadowing: Dom0 disabled, DomU disabled
When calling sysbench memory run --memory-block-size=16K
I get significantly closer numbers (20813.22 MiB/sec
in dom0 vs 8135.43 MiB/sec
in PVH domU). PV domU performs even worse (5140.72 MiB/sec
). The difference between PV dom0 and PV domU surprises me.
When calling
sysbench memory run --memory-block-size=16K
I get significantly closer numbers (20813.22 MiB/sec
in dom0 vs8135.43 MiB/sec
in PVH domU). PV domU performs even worse (5140.72 MiB/sec
). The difference between PV dom0 and PV domU surprises me.
It surprises me too. @andyhhp do you have suggestions for debugging this? Is there a way to get stats on TLB misses? I wonder if CPU pinning would help.
[Summary: sysbench's event timing interacts poorly with the high-overhead xen clocksource in PV and some PVH VMs.]
I think we may be seeing a mirage, or rather, a side effect of other system calls being made in parallel with the memory ones.
I played around a bit with strace -f sysbench...
...and noticed that under domU PV but not under dom0 PV, I saw an additional 75K lines in systrace output with this pattern:
[pid 2717] clock_gettime(CLOCK_MONOTONIC, {tv_sec=10331, tv_nsec=199069018}) = 0
After some additional experimenting and googling, I found that I can get "terrible sysbench results" from PV dom0 by performing the following (as root):
echo "xen" > /sys/devices/system/clocksource/clocksource0/current_clocksource # change to what domU uses
And I can then "restore good sysbench results" from PV dom0 by performing the following (as root):
echo "tsc" > /sys/devices/system/clocksource/clocksource0/current_clocksource # the default for dom0
Here's where it gets even stranger (caveat: testing on two different pieces of hardware)
Under R4.0 (Xen 4.8), PVH domU uses "xen" as the clocksource by default but it does not have as severe as an impact, with performance closer to dom0. Under R4.1 (Xen 4.14), PVH domU uses "xen" as the clocksource by default and appears to be as severely impacted as PV domU, at least on this particular system.
Even more fun: Under R4.0 PVH domU only have "xen" available as a clocksource, so I can't reverse the experiment with R4.0 Under R4.1 PVH domU default to "xen" but DOES have an available_clocksource of "tsc xen". If I perform do the echo "tsc" command above inside a R4.1 PVH domU, I suddenly "see good sysbench results".
To reiterate: I don't think this is a memory performance problem.
B
...and noticed that under domU PV but not under dom0 PV, I saw an additional 75K lines in systrace output with this pattern:
[pid 2717] clock_gettime(CLOCK_MONOTONIC, {tv_sec=10331, tv_nsec=199069018}) = 0
Yeah, that is definitely not going to be fast :laughing:. Can you provide concrete numbers? @fepitre would you be willing to see if this helps your problems?
R4.1 invoking
sysbench memory run
dom VMType Clocksource Measurement
--- ------ ----------- -----------
dom0 PV tsc (default) 7276.61 MiB/sec
dom0 PV xen 4.79 MiB/sec
domU PV xen (default) 4.56 MiB/sec
domU PV tsc 7012.89 MiB/sec
domU PVH xen (default) 5.19 MiB/sec
domU PVH tsc 7158.07 MiB/sec
Again, working theory is that it's not an actual memory allocation speed issue, but an issue with how sysbench does timing paired with the relatively high-overhead "xen" clocksource.
EDIT: correction to chart above, tsc is available in R4.1 domU PV. (at least on this hardware)
Unsurprisingly, it looks like someone else ran into the same issue nine years ago using sysbench under the kvm-clock clocksource... https://blog.siphos.be/2013/04/comparing-performance-with-sysbench-part-3/
...nothing new under the sun.
After some additional googling, I also want to note that quite a few folks found that for certain workloads in AWS's XEN-based instances over the past 5-10 years (e.g. linux database performance tracking), particularly where timers are used heavily, switching from "xen" to "tsc", when available, had a material impact on performance.
Brendan
Ah. clocksources. An unmitigated set of disasters on native as well as under virt.
For Qubes, you're not migrating VMs, so it's safe to expose the Invariant TSC CPU feature to all guests, which is almost certainly what is triggering the different default between dom0 and domU. Set itsc=1
in the VM config file.
Thanks! That's Xen 4.14, it's nomigrate=1
there. Now it's just a matter of setting it via libvirt...
Urgh. I've been trying to kill nomigrate
and it has a habit of segfaulting libvirt for reasons we never got to the bottom of.
It might be easier to pass cpuid="host:invtsc=1"
May I have several questions on this?
Is the selection of clock sources (xen or tsc) affect only the accuracy of timing or affect the actual performance when doing sysbench?
Will normal user need to manually config the clocksource after the update, or will the update set the clocksource of all domU to be tsc by default?
@logoerthiner1
Accuracy is not affected as far as I'm aware
No manual configuration required since Qubes dynamically creates libvirt xml for every VM, using a template xml file. See the above commit for more info
Accuracy is fine. The performance difference is between gettimeofday()
completing in the VDSO without a system call, vs needing a system call.
It might be easier to pass
cpuid="host:invtsc=1"
I tried, looks to be ignored (https://github.com/xen-project/xen/blob/stable-4.14/xen/arch/x86/cpuid.c#L661-L667)
It might be easier to pass
cpuid="host:invtsc=1"
I tried, looks to be ignored (https://github.com/xen-project/xen/blob/stable-4.14/xen/arch/x86/cpuid.c#L661-L667)
:disappointed: This is bringing back the scars of trying to fix the mess. Begrudgingly, yes, use nomigrate on 4.14. You will have to change it when you move to a newer Xen.
Ok, I've set nomigrate and confirmed that guest sees INVTSC bit set. But Linux still chooses "xen" clocksource by default :/
I can't find any part of Linux that would use INVTSC to affect clocksource choice. All I see is "rating" - "tsc" has 400, "xen" has 500. And the highest available wins.
I can't find any part of Linux that would use INVTSC to affect clocksource choice. All I see is "rating" - "tsc" has 400, "xen" has 500. And the highest available wins.
Ad-hoc kernel patch time? (yes, yuck)
Nope, clocksource=tsc
to kernel cmdline.
I stumbled on this painful thread from ~15 years ago w/r/t the xen vs tsc timers in VMs, granted from an era where it appears the tsc clocksource was only beginning to be reliable across cores.
https://sourceforge.net/p/kvm/mailman/kvm-devel/thread/47267832.1060003%40zytor.com/?page=0
My empathy for @andyhhp (well, all involved in Xen and the Xen<->Linux border) just increased 10-fold.
B
@marmarek - I see you're applying the clocksource=tsc change to stubdoms as well. I'm curious about what you found in testing?
B
That's mostly in a hope to improve audio quality (pulseaudio reads clock very often), if one uses emulated sound card for HVM (with Windows, for example).
That's mostly in a hope to improve audio quality (pulseaudio reads clock very often), if one uses emulated sound card for HVM (with Windows, for example).
Ah, yeah, that'd be a nice unexpected win.
B
Automated announcement from builder-github
The component linux-kernel-latest
(including package kernel-latest-5.17.4-2.fc25.qubes
) has been pushed to the r4.0
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The component linux-kernel-latest
(including package kernel-latest-5.17.4-2.fc32.qubes
) has been pushed to the r4.1
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The component linux-kernel
(including package kernel-5.10.112-1.fc32.qubes
) has been pushed to the r4.1
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
That's mostly in a hope to improve audio quality (pulseaudio reads clock very often), if one uses emulated sound card for HVM (with Windows, for example).
Ah, yeah, that'd be a nice unexpected win.
Hmm, windows audio still crackly after this morning's kernel updates which seem to have moved qubes template based VMs to currentclocksource=tsc.
Ah wait...stub domain change isn't yet pushed to dom0, I still see currentclocksource=xen in stubdomain for windows.
B
In my environment the sound became perfect :)
I thought first of all about the USB webcam, but there alas without much progress on Win10
Automated announcement from builder-github
The component vmm-xen-stubdom-linux
(including package xen-hvm-stubdom-linux-1.2.4-1.fc32
) has been pushed to the r4.1
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Hmm
R4.1 win 10 still crackly after the just-released stubdomain update as well and in stubdomain I see currentclocksource=tsc.
It gets less crackly (but not perfect) if I busybox renice -n -30 $pulseaudio_pid in the stubdomain.
I haven't rebuilt QWT since December 2021 tabit-pro repo content, so perhaps I need to do so for improved audio as well?
B
I've just installed kernel-5.10.112-1.fc32.qubes
and xen-hvm-stubdom-linux-1.2.4-1.fc32
. This significantly improves the results of synthetic benchmarks like sysbench memory run
. But doesn't fix the performance problems we are facing:
make defconfig
for x86_64Environment | Minutes |
---|---|
Fedora 35 (native) | 6:42 |
Fedora 35 (Qubes AppVM) | 13:09 |
Please reopen this ticket.
How to file a helpful issue
Qubes OS release
R4.1
Brief summary
The Xen hypervisor has performance problems on certain compute-intensive workloads
Steps to reproduce
See @fepitre for details
Expected behavior
Same (or almost same) performance as bare hardware
Actual behavior
Less performance than bare hardware