QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
541 stars 48 forks source link

Xen w/ IOMMU is broken on some AMD Ryzen CPUs unless SMT is disabled in BIOS, causing Dom0 boot hanging with NVMe timeouts #8136

Open biergaizi opened 1 year ago

biergaizi commented 1 year ago

Qubes OS release

Qubes release 4.1.1 (R4.1)

Brief summary

Due to an upstream Xen issue [1] - currently with no documentation or even a proper upstream bug report - on some AMD Ryzen CPUs / motherboards, IOMMU malfunctions on Xen. One symptom of a broken IOMMU is a system hang during boot at initramfs's splash screen, with "nvme0: I/O 0 QID 0 timeout, completion polled" messages. Other users have also reported boot hanging when using the Qubes installation disc.

One workaround is disabling SMT (hyperthreading) in BIOS. This is harmless in Qubes since Qubes does not use SMT, but without documentation, it's extremely difficult to find this workaround. I spent half an hour searching for this error message before finding a forum post mentioning SMT. This question is also raised at a Xen mailing list but without any response, indicating that the problem should be worked on the upstream first.

This is likely a duplicate of #7620, #7570 or other previously reported issues that I'm not familiar with. However, the disable-SMT workaround has not appeared in any of the existing bug report that I'm aware of. All the existing report was also hardware-specific, but now it's clear that it's a systematic issue. Thus, I propose that it should be treated as a separate lack-of-documentation bug report. Though, other workarounds like dom0_max_vcpus=1 dom0_vcpus_pin should also be documented.

Affected Hardware

Some examples include:

  1. Ryzen 9 6900HS mobile CPU (2022 G14 GA402RK laptop). [2]
  2. AMD 5700X desktop CPU, multiple cases on multiple motherboards. [2]
  3. Ryzen 7 6800U (GPD Win Max 2 laptop). [3]
  4. Unspecified Zen 3 CPU with Asus Pro WS 565-ACE motherboard (X570 chipset), official Xen mailing list report. [1]

Steps to reproduce

  1. Install QubesOS onto a NVMe SSD on an Intel motherboard.

  2. Move QubesOS to an AMD AM4 motherboard with X399 or X570 chipset, with an Ryzen 5000 series CPU (Zen 3) installed.

  3. Boot to NVMe. To allow seeing the error messages, now disable plymouth splash screen using root via the commands:

    echo 'omit_dracutmodules+=" plymouth "' > /etc/dracut.conf.d/disable-plymouth.conf
    cd /boot
    dracut --force
  4. Enable IOMMU in BIOS.

  5. Reboot to NVMe.

OR

  1. Boot QubesOS installer on the same AMD hardware (I didn't test it, but it was mentioned in a forum post).

Expected behavior

Boot should continue without hanging, the LUKS passphrase prompt should appear and one should be enter QubesOS after typing the passphrase.

Actual behavior

initramfs hangs at splash screen. If plymouth is disabled, after waiting for 3 to 5 minutes, NVMe timeout messages will appear in dmesg and be printed on the screen, similar to:

nvme nvme0: I/O 0 QID 0 timeout, completion polled
nvme nvme1: I/O 8 QID 0 timeout, completion polled

Workaround

Disable Simultaneous Multi-Threading (SMT) in firmware, via the UEFI BIOS setup screen (SMT is more commonly known by users as Intel's trademark "Hyperthreading", and it's worth mentioning it in the documentation).

Other workarounds include other dom0_max_vcpus=1 dom0_vcpus_pin, previously described in other bug reports.

References

[1] Hang booting Dom0: nvme timeout, completion polled

https://lists.xenproject.org/archives/html/xen-users/2023-03/msg00001.html

[2] Installer does not boot - nvme timeout completion polled

https://forum.qubes-os.org/t/installer-does-not-boot-nvme-timeout-completion-polled/13639/2

[3] GPD Win Max 2 - Unable to boot installer

https://forum.qubes-os.org/t/gpd-win-max-2-unable-to-boot-installer/14466

marmarek commented 1 year ago

Does adding x2apic_phys=true to Xen options help?

axGit234 commented 1 year ago

I had recently updated my bios of an asrock p570 pro4 and had the same problems as you describe. The option x2apic_phys=true has removed the issue. The usage of SMT is also still possible. The original bios version 3.20 was not affected by this issue.

v6ak commented 1 year ago

So, IIUC, you were able to install the system, just it didn't boot?

I've just experienced a potentially related issue, but my story is a bit different. I haven't done a full installation, just changed the motherboard (same CPU (Ryzen 7 5800X), same SSD, even same chipset (B550), but a different MoBo (ASRock B550 Phantom Gaming 4/AC to ASUS TUF Gaming B550-plus). After changing the MoBo, I needed to run efibootmgr in order to make it bootable, so I tried to boot rescue from the USB installer and failed:

Inspired by this issue, I've disabled SMT and it booted quickly then.

I can do some further experiment with the current MoBo. While I could theoretically also do some experiment with the old MoBo, but I don't wish to change MoBos back and forth.

EDIT: Also, the SMT off in BIOS prevents suspend, or at least the BIOS GUI mentions it.

marmarek commented 1 year ago

IIUC, the system rescue doesn't use Xen (xentop just hangs), so it might not be Xen-related.

It does. But xenstored is not running, so most Xen tools wont work (but xl info and xl dmesg do, and this is what we care about in the installer).