Open marmarek opened 3 years ago
I cannot reproduce it locally, but it seems HVM is significantly slower than PVH (at least its boot time). This may be slow enough on openQA (thanks to nested virt) to hit the timeout.
Attempted to test this, just providing observations:
The very first boot, the VM did not start, but the errors observed on the console involved systemd-udevd invoking oom-killer 7 seconds after boot.
Unsure if related as the logs do not line up with what is originally posted. Attached: fed32-hvm-oom.txt
Could not reproduce after the very first boot, even with changing memory values, keeping in-vm kernel:
Ok, so this is OOM in initramfs, where the swap is not enabled yet. And apparently now Fedora is running fsck there... Perhaps we should adjust our dracut module to enable swap in initramfs already? AFAIR @DemiMarie suggested that already in a private email. While I don't want to mess too much with the system setup from dom0-provided kernel (as it should work with any distribution, so there is a great risk for incompatibilities), but for initramfs generated by Fedora's dracut inside that VM it should be fine.
That said, I now did once got oom-killer killing systemd-udevd in fedora-32 based VM (after initramfs phase, but still before enabling swap), with dom0-provided kernel.
Where are the times when Linux was happy with 128MB RAM in total...
I suspect it largely depends on config options. There are Linux devices (Azure Sphere) running happily with 4MiB of RAM. And fsck is known to be somewhat RAM intensive, especially on large partitions. That’s why OpenBSD turns on swap first, and I think we should do the same.
Ideally, we should enable swap before doing just about anything else.
Swap may be some factor here, but definitely not the only one. HVM seems to boot significantly slower even with the same kernel from dom0. Here are two boot logs, one PVH, the other one HVM:
Part of it is because of stubdomain startup (10s between Logfile Opened
line and the first log from the kernel), but it still doesn't explain the whole difference.
Automated announcement from builder-github
The package linux-utils
has been pushed to the r4.1
testing repository for the CentOS centos8
template.
To test this update, please install it with the following command:
sudo yum update --enablerepo=qubes-vm-r4.1-current-testing
Automated announcement from builder-github
The package python3-qubesimgconverter-4.1.12-1.fc32
has been pushed to the r4.1
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The package qubes-utils_4.1.12
has been pushed to the r4.1
testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list
by uncommenting the line containing buster-testing
(or appropriate equivalent for your template version), then use the standard update command:
sudo apt-get update && sudo apt-get dist-upgrade
Automated announcement from builder-github
The component linux-utils
(including package python3-qubesimgconverter-4.1.12-1.fc32
) has been pushed to the r4.1
testing repository for the Fedora template.
To test this update, please install it with the following command:
sudo yum update --enablerepo=qubes-vm-r4.1-current-testing
I seem to remember a bug that prevented Linux from recognizing more than 1 vCPU in HVM mode, if over a certain amount of memory was supplied.
Very crude timings observed starting an HVM Fedora 32 qube (400MB memory, not memory balanced, no PCI devices):
Run: qvm-run hvm
random: crng init done
{"timestamp": {..., "event":"DEVICE_DELETED", "data": {"path": "/machine/peripheral-anon/device[']"}}
IFS= read -r line
Qubes initramfs script
[press ENTER to login]
For the last 2 items, same amount of time was observed for PVH, so most of the delays are in:
Automated announcement from builder-github
The package linux-utils
has been pushed to the r4.1
stable repository for the CentOS centos8
template.
To install this update, please use the standard update command:
sudo yum update
Automated announcement from builder-github
The package qubes-utils_4.1.12+deb10u1
has been pushed to the r4.1
stable repository for the Debian template.
To install this update, please use the standard update command:
sudo apt-get update && sudo apt-get dist-upgrade
Automated announcement from builder-github
The component linux-utils
(including package python3-qubesimgconverter-4.1.12-1.fc32
) has been pushed to the r4.1
stable repository for the Fedora template.
To install this update, please use the standard update command:
sudo yum update
Automated announcement from builder-github
The package python3-qubesimgconverter-4.1.12-1.fc32
has been pushed to the r4.1
stable repository for dom0.
To install this update, please use the standard update command:
sudo qubes-dom0-update
Or update dom0 via Qubes Manager.
@easydozen the easiest way to notice it, is to compare start time of the same VM, with virt_mode=pvh
and virt_mode=hvm
. As you can see in comments above, the difference is rather drastic. In my case, I can blame stubdomain startup for at most 10s (which also is bad), but the difference is much bigger.
@marmarek right now it's 13s for hvm and 5s for pvh in my case, but my environment a bit different due to tests and experiments. I'll try to reproduce regression.
Another data point, from R4.0 (don't compare directly with the other one, this is different hardware):
So, the difference is significant here too, but not 4x.
On Qubes-20210327-x86_64.iso it seems like cpu frequency scaling issue. I saw something similar on 4.0 in a context of sleep resume. Since that I'm using echo "xen-acpi-processor" > /etc/modules-load.d/xenacpi.conf.
modprobe xen-acpi-processor should be helpful as a fastfix but I'm not sure if it's a proper way - maybe this https://wiki.xenproject.org/wiki/Talk:Xen_power_management old reasoning could give us a clue.
Not loading it is a bug: https://github.com/QubesOS/qubes-issues/issues/6465
Major debugging progress: https://lore.kernel.org/xen-devel/YHjv4DGpEmyJjpei@mail-itl/T/#u
I'm willing to help here but can't reproduce the problem (if its not a xen-acpi-processor bug). Even on a rather old hardware (T450s) boot time doesn't seem unusual for me: PVH: 15s HVM: 33s
X1 Carbon (gen7): PVH: 5s HVM: 13s
I'm not sure if its worth mentioning, but increasing vm' initial memory can save several seconds of boot time.
New patch provides an impressive improvement. It takes only 7s to boot HVM with 2Gb initial memory on X1 Carbon (gen7).
Yes, that's the biggest speedup in the series. That said, I still have two more to do - see the summary in the above linked email. One part is posted already at https://lore.kernel.org/xen-devel/cover.3a5d506462133586bd805b72a226916af6a33799.1619482896.git-series.marmarek@invisiblethingslab.com/T/#t (but requires changes in libvirt xml template to really be used)
This is still a problem with a kernel from inside of the VM: https://gist.github.com/marmarek/32cd8dc2486b75b410f37238245c1110
This is still a problem with a kernel from inside of the VM: https://gist.github.com/marmarek/32cd8dc2486b75b410f37238245c1110
Looks like Linux is spending 8 seconds generating entropy. Can we provide an external (virtio-random?) device for this? It would be seeded directly from dom0, perhaps over a dedicated vchan.
I wonder if this is a separate issue (or another qemu artifact?); I am not seeing similar logs with respect to crng
in my R4.1 environment.
In all of my guest kernel logs in a Fedora 33 VM, I observe:
In-VM kernel (5.12.7-200.fc33.x86_64):
[0.027xxx] random: get_random_u64 called from kmem_cache_open+0x2a/0x560 with crng_init=0
[0.048xxx] random: crng done (trusting CPU's manufacturer)
dom0 kernel-latest (5.12.5-1.fc32):
[0.059xxx] random: get_random_u64 called from kmem_cache_open+0x1e/0x220 with crng_init=0
[0.083xxx] random: crng done (trusting CPU's manufacturer)
The only time I have observed a "long" delay with respect to crng things is in the stubdomain with dom0 kernel, prior to the DMA fix - this was 6-8 seconds. With the DMA fix, this delay was reduced to 2 seconds, so I'm curious how changing the loading of the guest kernel affected RNG things.
With (QubesOS/qubes-vmm-xen-stubdom-linux@d868386aa7f63) "DMA fix", the startup delay has reduced significantly!
Crude timings of time qvm-start vm
:
20s: R4.1, without DMA fix, HVM, dom0 kernel, 400MB/no balancing
14s: R4.1, with DMA fix, HVM, dom0 kernel, 400MB/no balancing
22s: R4.1, without DMA fix, HVM, in-vm kernel, 400MB/no balancing
22s: R4.1, with DMA fix, HVM, in-vm kernel, 400MB/no balancing
32s: R4.1, without DMA fix, HVM, dom0 kernel, 400MB/with balancing
25s: R4.1, with DMA fix, HVM, dom0 kernel, 400MB/with balancing
24s: R4.1, without DMA fix, HVM, in-vm kernel, 400MB/with balancing
25s: R4.1, with DMA fix, HVM, in-vm kernel, 400MB/with balancing
09s: R4.1, PVH, dom0 kernel, 400MB/no balancing
13s: R4.1, PVH, dom0 kernel, 400MB/with balancing
My R4.0 is on different hardware, so numbers should not be compared directly:
Weak conclusions that can be made for HVMs:
I avoid making many conclusions as there are likely other variables and my data collection was very crude - this serves primarily as one data point in evaluating the updates.
[0.048xxx] random: crng done (trusting CPU's manufacturer)
This! I've just rechecked the failed log, and I don't see trusting CPU's manufacturer
part there. And indeed that CPU does not support RDRAND. This means, the extreme issue I see, applies only to quite old systems (and hopefully does not affect majority of our users - even good old x230 already has RDRAND). So, I'm lowering the priority. But it's still worth improving the situation.
One more thing worth testing: iommu=noforce
on the VM's cmdline. Linux tries to reserve 64MB for a bounce buffer for DMA, based on some unclear to me heuristic. If a VM has PCI device assigned, we add swiotlb=8192
argument, which lowers it to 16MB. But in fact, I think, this should not be needed at all, since we do require working IOMMU (I think it was needed in practice only for PV domains with PCI devices).
So, iommu=noforce
should (in theory) free significant (compared to 400MB) part of VM's memory.
From quick testing iommu=noforce
didn't appear to make a difference in VM startup time. I believe some reference indicated iommu=noforce
was the default on Intel machines, but I don't know enough about what is being tested/modified.
What does appear to be constant is a ~4 second pause during Waiting for /dev/xvda* devices
(with dom0-provided kernel) however I see that on R4.0 too, but I don't want to lose focus on the R4.1 HVM issue.
Automated announcement from builder-github
The package xen-hvm-stubdom-linux-1.2.0-1.fc32
has been pushed to the r4.1
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The package xen-hvm-stubdom-linux-1.2.0-1.fc32
has been pushed to the r4.1
stable repository for dom0.
To install this update, please use the standard update command:
sudo qubes-dom0-update
Or update dom0 via Qubes Manager.
[0.048xxx] random: crng done (trusting CPU's manufacturer)
This! I've just rechecked the failed log, and I don't see
trusting CPU's manufacturer
part there. And indeed that CPU does not support RDRAND. This means, the extreme issue I see, applies only to quite old systems (and hopefully does not affect majority of our users - even good old x230 already has RDRAND). So, I'm lowering the priority. But it's still worth improving the situation.
Strongly discouraged to rely on RDRAND for security / entropy quality anyhow as per: https://www.whonix.org/wiki/Dev/Entropy#RDRAND
Strongly discouraged to rely on RDRAND for security / entropy quality anyhow as per:
In context of this issue, it is not a problem, because stubdomain does not use RNG for any security critical task. There is not crypto involved etc. One could argue it may make ASLR for qemu less effective, but we don't consider qemu trusted, so it is not a huge deal (and remember the RDRAND issues are still very hypothetical - see below).
In a broader context of RDRAND, I don't think we should worry about backdoors there. Or rather: if you consider intentional backdoors in your CPU a valid threat, throw away that CPU. There is no really a difference how such hypothetical backdoor could work - whether that would be predictable RDRAND, reacting to some magic values to any other instruction, or anything else. We could worry about its effectiveness - not intentional bugs, which indeed is hard to reason about, since its being opaque.
Observation
openQA test in scenario qubesos-4.1-pull-requests-x86_64-system_tests_pvgrub_salt_storage@64bit fails in TC_41_HVMGrub_fedora-32
Possibly related VM console entries:
Note double
root=
. That isn't necessary the root cause.The same test works on debian-10.
Test suite description
Setup fedora-32 StandaloneVM HVM with 'kernel' set to none.
Reproducible
Fails since (at least) Build 2020103122-4.1 (current job)
Expected result
Last good: 2020103116-4.1 (or more recent)
Further details
Always latest result in this scenario: latest