HVM is significantly slower than PVH (Xen 4.14) - Githubissues

QubesOS / qubes-issues

The Qubes OS Project issue tracker

https://www.qubes-os.org/doc/issue-tracking/

532 stars 46 forks source link

HVM is significantly slower than PVH (Xen 4.14) #6174

Open marmarek opened 3 years ago

marmarek commented 3 years ago

Observation

openQA test in scenario qubesos-4.1-pull-requests-x86_64-system_tests_pvgrub_salt_storage@64bit fails in TC_41_HVMGrub_fedora-32

qubes.exc.QubesVMError: Cannot connect to qrexec agent for 90 seconds, see /var/log/xen/console/guest-test-inst-vm1.log for details

Possibly related VM console entries:

[2020-11-01 05:08:54] [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.8.16-200.fc32.x86_64 root=/dev/xvda3 ro root=/dev/mapper/dmroot console=tty0 console=hvc0 swiotlb=8192 noresume xen_scrub_pages=0
...
[2020-11-01 05:09:52] [[0;32m  OK  [0m] Found device [0;1;39m/dev/mapper/dmroot[0m.
[2020-11-01 05:09:52] [[0;32m  OK  [0m] Reached target [0;1;39mInitrd Root Device[0m.
[2020-11-01 05:09:53] [[0;32m  OK  [0m] Finished [0;1;39mdracut initqueue hook[0m.
[2020-11-01 05:09:53] [[0;32m  OK  [0m] Reached target [0;1;39mRemote File Systems (Pre)[0m.
[2020-11-01 05:09:53] [[0;32m  OK  [0m] Reached target [0;1;39mRemote File Systems[0m.
[2020-11-01 05:09:53]          Starting [0;1;39mdracut pre-mount hook[0m...
[2020-11-01 05:09:53] [   56.786209] audit: type=1130 audit(1604207392.968:15): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-initqueue comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2020-11-01 05:09:53] [[0;32m  OK  [0m] Finished [0;1;39mdracut pre-mount hook[0m.
[2020-11-01 05:09:53]          Starting [0;1;39mFile System Check on /dev/mapper/dmroot[0m...
[2020-11-01 05:09:53] [   57.021420] audit: type=1130 audit(1604207393.190:16): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-pre-mount comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2020-11-01 05:09:53] [[0;32m  OK  [0m] Finished [0;1;39mFile System Check on /dev/mapper/dmroot[0m.
[2020-11-01 05:09:53] [   57.516444] audit: type=1130 audit(1604207393.708:17): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-fsck-root comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[2020-11-01 05:09:54]          Mounting [0;1;39m/sysroot[0m...
[2020-11-01 05:09:54] [   57.852351] EXT4-fs (xvda3): mounted filesystem with ordered data mode. Opts: (null)
[2020-11-01 05:09:54] [[0;32m  OK  [0m] Mounted [0;1;39m/sysroot[0m.
[2020-11-01 05:09:54] [[0;32m  OK  [0m] Reached target [0;1;39mInitrd Root File System[0m.
[2020-11-01 05:09:54]          Starting [0;1;39mReload Configuration from the Real Root[0m...
[2020-11-01 05:09:54] [   58.138307] audit: type=1334 audit(1604207394.330:18): prog-id=5 op=UNLOAD
[2020-11-01 05:09:54] [   58.176370] audit: type=1334 audit(1604207394.346:19): prog-id=4 op=UNLOAD
[2020-11-01 05:09:54] [   58.200801] audit: type=1334 audit(1604207394.346:20): prog-id=3 op=UNLOAD
[2020-11-01 05:09:54] [   58.228330] audit: type=1334 audit(1604207394.360:21): prog-id=7 op=UNLOAD
[2020-11-01 05:09:54] [   58.252140] audit: type=1334 audit(1604207394.361:22): prog-id=6 op=UNLOAD

Note double root=. That isn't necessary the root cause.

The same test works on debian-10.

Test suite description

Setup fedora-32 StandaloneVM HVM with 'kernel' set to none.

Reproducible

Fails since (at least) Build 2020103122-4.1 (current job)

Expected result

Last good: 2020103116-4.1 (or more recent)

Further details

Always latest result in this scenario: latest

marmarek commented 3 years ago

I cannot reproduce it locally, but it seems HVM is significantly slower than PVH (at least its boot time). This may be slow enough on openQA (thanks to nested virt) to hit the timeout.

icequbes1 commented 3 years ago

Attempted to test this, just providing observations:

The very first boot, the VM did not start, but the errors observed on the console involved systemd-udevd invoking oom-killer 7 seconds after boot.

Unsure if related as the logs do not line up with what is originally posted. Attached: fed32-hvm-oom.txt

Could not reproduce after the very first boot, even with changing memory values, keeping in-vm kernel:

64MB: locked memory results in Triple fault in hypervisor.log
128MB: kernel page faults 2 seconds after kernel boot
256MB: successful boot (though 100MB of swap in use in-vm right after boot)
other values: no issues booting, though cosmetic issues with QubesOS/updates-status#2166 observed only for in-kernel HVM.

marmarek commented 3 years ago

Ok, so this is OOM in initramfs, where the swap is not enabled yet. And apparently now Fedora is running fsck there... Perhaps we should adjust our dracut module to enable swap in initramfs already? AFAIR @DemiMarie suggested that already in a private email. While I don't want to mess too much with the system setup from dom0-provided kernel (as it should work with any distribution, so there is a great risk for incompatibilities), but for initramfs generated by Fedora's dracut inside that VM it should be fine.

That said, I now did once got oom-killer killing systemd-udevd in fedora-32 based VM (after initramfs phase, but still before enabling swap), with dom0-provided kernel.

Where are the times when Linux was happy with 128MB RAM in total...

DemiMarie commented 3 years ago

I suspect it largely depends on config options. There are Linux devices (Azure Sphere) running happily with 4MiB of RAM. And fsck is known to be somewhat RAM intensive, especially on large partitions. That’s why OpenBSD turns on swap first, and I think we should do the same.

DemiMarie commented 3 years ago

Ideally, we should enable swap before doing just about anything else.

marmarek commented 3 years ago

Swap may be some factor here, but definitely not the only one. HVM seems to boot significantly slower even with the same kernel from dom0. Here are two boot logs, one PVH, the other one HVM:

PVH took 6s to boot
HVM took 24s to boot

Part of it is because of stubdomain startup (10s between Logfile Opened line and the first log from the kernel), but it still doesn't explain the whole difference.

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The package linux-utils has been pushed to the r4.1 testing repository for the CentOS centos8 template. To test this update, please install it with the following command:

sudo yum update --enablerepo=qubes-vm-r4.1-current-testing

Changes included in this update

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The package python3-qubesimgconverter-4.1.12-1.fc32 has been pushed to the r4.1 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The package qubes-utils_4.1.12 has been pushed to the r4.1 testing repository for the Debian template. To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing buster-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The component linux-utils (including package python3-qubesimgconverter-4.1.12-1.fc32) has been pushed to the r4.1 testing repository for the Fedora template. To test this update, please install it with the following command:

sudo yum update --enablerepo=qubes-vm-r4.1-current-testing

Changes included in this update

DemiMarie commented 3 years ago

I seem to remember a bug that prevented Linux from recognizing more than 1 vCPU in HVM mode, if over a certain amount of memory was supplied.

icequbes1 commented 3 years ago

Very crude timings observed starting an HVM Fedora 32 qube (400MB memory, not memory balanced, no PCI devices):

Run: qvm-run hvm

t=0, Xen starts emitting stuff about the new HVM domain
t=0 through t=6: xenstored in dom0 observed to be taking up most of the CPU %
t=6, Xen HVM Loader logs start
t=6, stubdom dm kernel logs/init start
t=6, qemu executed in stubdom
t=13: in stubdom log
- random: crng init done
- {"timestamp": {..., "event":"DEVICE_DELETED", "data": {"path": "/machine/peripheral-anon/device[']"}}
- IFS= read -r line
t=14: kernel for target domain starts
t=15: kernel starts executing /init, Qubes initramfs script
t=20: console of guest/[press ENTER to login]

For the last 2 items, same amount of time was observed for PVH, so most of the delays are in:

6 second delay while xenstored and lvm stuff gets setup when first starting
7 second delay from qemu invoke til non-stub domain kernel logs are seen

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The package linux-utils has been pushed to the r4.1 stable repository for the CentOS centos8 template. To install this update, please use the standard update command:

sudo yum update

Changes included in this update

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The package qubes-utils_4.1.12+deb10u1 has been pushed to the r4.1 stable repository for the Debian template. To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The component linux-utils (including package python3-qubesimgconverter-4.1.12-1.fc32) has been pushed to the r4.1 stable repository for the Fedora template. To install this update, please use the standard update command:

sudo yum update

Changes included in this update

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The package python3-qubesimgconverter-4.1.12-1.fc32 has been pushed to the r4.1 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

marmarek commented 3 years ago

@easydozen the easiest way to notice it, is to compare start time of the same VM, with virt_mode=pvh and virt_mode=hvm. As you can see in comments above, the difference is rather drastic. In my case, I can blame stubdomain startup for at most 10s (which also is bad), but the difference is much bigger.

easydozen commented 3 years ago

@marmarek right now it's 13s for hvm and 5s for pvh in my case, but my environment a bit different due to tests and experiments. I'll try to reproduce regression.

marmarek commented 3 years ago

Another data point, from R4.0 (don't compare directly with the other one, this is different hardware):

PVH: 13s
HVM: 29s

So, the difference is significant here too, but not 4x.

easydozen commented 3 years ago

On Qubes-20210327-x86_64.iso it seems like cpu frequency scaling issue. I saw something similar on 4.0 in a context of sleep resume. Since that I'm using echo "xen-acpi-processor" > /etc/modules-load.d/xenacpi.conf.

modprobe xen-acpi-processor should be helpful as a fastfix but I'm not sure if it's a proper way - maybe this https://wiki.xenproject.org/wiki/Talk:Xen_power_management old reasoning could give us a clue.

marmarek commented 3 years ago

Not loading it is a bug: https://github.com/QubesOS/qubes-issues/issues/6465

marmarek commented 3 years ago

Major debugging progress: https://lore.kernel.org/xen-devel/YHjv4DGpEmyJjpei@mail-itl/T/#u

easydozen commented 3 years ago

I'm willing to help here but can't reproduce the problem (if its not a xen-acpi-processor bug). Even on a rather old hardware (T450s) boot time doesn't seem unusual for me: PVH: 15s HVM: 33s

X1 Carbon (gen7): PVH: 5s HVM: 13s

I'm not sure if its worth mentioning, but increasing vm' initial memory can save several seconds of boot time.

easydozen commented 3 years ago

New patch provides an impressive improvement. It takes only 7s to boot HVM with 2Gb initial memory on X1 Carbon (gen7).

marmarek commented 3 years ago

Yes, that's the biggest speedup in the series. That said, I still have two more to do - see the summary in the above linked email. One part is posted already at https://lore.kernel.org/xen-devel/cover.3a5d506462133586bd805b72a226916af6a33799.1619482896.git-series.marmarek@invisiblethingslab.com/T/#t (but requires changes in libvirt xml template to really be used)

marmarek commented 3 years ago

This is still a problem with a kernel from inside of the VM: https://gist.github.com/marmarek/32cd8dc2486b75b410f37238245c1110

DemiMarie commented 3 years ago

This is still a problem with a kernel from inside of the VM: https://gist.github.com/marmarek/32cd8dc2486b75b410f37238245c1110

Looks like Linux is spending 8 seconds generating entropy. Can we provide an external (virtio-random?) device for this? It would be seeded directly from dom0, perhaps over a dedicated vchan.

icequbes1 commented 3 years ago

I wonder if this is a separate issue (or another qemu artifact?); I am not seeing similar logs with respect to crng in my R4.1 environment.

In all of my guest kernel logs in a Fedora 33 VM, I observe:

In-VM kernel (5.12.7-200.fc33.x86_64):

[0.027xxx] random: get_random_u64 called from kmem_cache_open+0x2a/0x560 with crng_init=0
[0.048xxx] random: crng done (trusting CPU's manufacturer)

dom0 kernel-latest (5.12.5-1.fc32):

[0.059xxx] random: get_random_u64 called from kmem_cache_open+0x1e/0x220 with crng_init=0
[0.083xxx] random: crng done (trusting CPU's manufacturer)

The only time I have observed a "long" delay with respect to crng things is in the stubdomain with dom0 kernel, prior to the DMA fix - this was 6-8 seconds. With the DMA fix, this delay was reduced to 2 seconds, so I'm curious how changing the loading of the guest kernel affected RNG things.

With (QubesOS/qubes-vmm-xen-stubdom-linux@d868386aa7f63) "DMA fix", the startup delay has reduced significantly!

Crude timings of time qvm-start vm:

20s: R4.1, without DMA fix, HVM, dom0 kernel, 400MB/no balancing
14s: R4.1, with DMA fix, HVM, dom0 kernel, 400MB/no balancing
22s: R4.1, without DMA fix, HVM, in-vm kernel, 400MB/no balancing
22s: R4.1, with DMA fix, HVM, in-vm kernel, 400MB/no balancing
32s: R4.1, without DMA fix, HVM, dom0 kernel, 400MB/with balancing
25s: R4.1, with DMA fix, HVM, dom0 kernel, 400MB/with balancing
24s: R4.1, without DMA fix, HVM, in-vm kernel, 400MB/with balancing
25s: R4.1, with DMA fix, HVM, in-vm kernel, 400MB/with balancing
09s: R4.1, PVH, dom0 kernel, 400MB/no balancing
13s: R4.1, PVH, dom0 kernel, 400MB/with balancing

My R4.0 is on different hardware, so numbers should not be compared directly:

18s: R4.0, without DMA fix, HVM, dom0 kernel, 400MB/no balancing
16s: R4.0, without DMA fix, HVM, in-vm kernel, 400MB/no balancing
23s: R4.0, without DMA fix, HVM, dom0 kernel, 400MB/with balancing
21s: R4.0, without DMA fix, HVM, in-vm kernel, 400MB/with balancing
07s: R4.0, PVH, dom0 kernel, 400MB/no balancing
11s: R4.0, PVH, dom0 kernel, 400MB/with balancing

Weak conclusions that can be made for HVMs:

The DMA fix provides a good improvement for dom0 kernels
But with memory balancing, performance is similar to in-vm kernels
Even with PVH, the memory balancing impact is significant

I avoid making many conclusions as there are likely other variables and my data collection was very crude - this serves primarily as one data point in evaluating the updates.

marmarek commented 3 years ago

[0.048xxx] random: crng done (trusting CPU's manufacturer)

This! I've just rechecked the failed log, and I don't see trusting CPU's manufacturer part there. And indeed that CPU does not support RDRAND. This means, the extreme issue I see, applies only to quite old systems (and hopefully does not affect majority of our users - even good old x230 already has RDRAND). So, I'm lowering the priority. But it's still worth improving the situation.

marmarek commented 3 years ago

One more thing worth testing: iommu=noforce on the VM's cmdline. Linux tries to reserve 64MB for a bounce buffer for DMA, based on some unclear to me heuristic. If a VM has PCI device assigned, we add swiotlb=8192 argument, which lowers it to 16MB. But in fact, I think, this should not be needed at all, since we do require working IOMMU (I think it was needed in practice only for PV domains with PCI devices). So, iommu=noforce should (in theory) free significant (compared to 400MB) part of VM's memory.

icequbes1 commented 3 years ago

From quick testing iommu=noforce didn't appear to make a difference in VM startup time. I believe some reference indicated iommu=noforce was the default on Intel machines, but I don't know enough about what is being tested/modified.

What does appear to be constant is a ~4 second pause during Waiting for /dev/xvda* devices (with dom0-provided kernel) however I see that on R4.0 too, but I don't want to lose focus on the R4.1 HVM issue.

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The package xen-hvm-stubdom-linux-1.2.0-1.fc32 has been pushed to the r4.1 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot commented 3 years ago

Automated announcement from builder-github

The package xen-hvm-stubdom-linux-1.2.0-1.fc32 has been pushed to the r4.1 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

adrelanos commented 2 years ago

[0.048xxx] random: crng done (trusting CPU's manufacturer)

This! I've just rechecked the failed log, and I don't see trusting CPU's manufacturer part there. And indeed that CPU does not support RDRAND. This means, the extreme issue I see, applies only to quite old systems (and hopefully does not affect majority of our users - even good old x230 already has RDRAND). So, I'm lowering the priority. But it's still worth improving the situation.

Strongly discouraged to rely on RDRAND for security / entropy quality anyhow as per: https://www.whonix.org/wiki/Dev/Entropy#RDRAND

marmarek commented 2 years ago

Strongly discouraged to rely on RDRAND for security / entropy quality anyhow as per:

In context of this issue, it is not a problem, because stubdomain does not use RNG for any security critical task. There is not crypto involved etc. One could argue it may make ASLR for qemu less effective, but we don't consider qemu trusted, so it is not a huge deal (and remember the RDRAND issues are still very hypothetical - see below).

In a broader context of RDRAND, I don't think we should worry about backdoors there. Or rather: if you consider intentional backdoors in your CPU a valid threat, throw away that CPU. There is no really a difference how such hypothetical backdoor could work - whether that would be predictable RDRAND, reacting to some magic values to any other instruction, or anything else. We could worry about its effectiveness - not intentional bugs, which indeed is hard to reason about, since its being opaque.