QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
543 stars 48 forks source link

rd.qubes.hide_pci doesn't work anymore after upgrade to Qubes 4.1 #7976

Open yojoe opened 1 year ago

yojoe commented 1 year ago

Qubes OS release

4.1

Brief summary

Hiding secondary GPU (AMD RX 580) from dom0 via Grub Command Line does not work anymore in Qubes 4.1. It was working on the same system with Qubes 4.0 previously.

Steps to reproduce

Set /etc/default/grub to hide the AMD Radeon RX 580 VGA and Audio devices from dom0 and regenerate grub.cfg.

Verify after reboot via cat /proc/cmdline it's there and has no typos.

$ cat /proc/cmdline 
placeholder root=/dev/mapper/qubes_dom0-root ro rd.luks.uuid=... rd.lvm.lv=qubes_dom0/root rd.lvm.lv=qubes_dom0/swap 
rd.qubes.hide_pci=01:00.0,01:00.1 xen-pciback.passthrough=1 i915.alpha_support=1 rhgb quiet plymouth.ignore-serial-consoles

Expected behavior

After reboot the following two PCI devices should not be visible to dom0 and lspci shouldn't enumerate them anymore:

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]

Actual behavior

After reboot, lspci in dom0 still enumerates the two PCI devices. Also the amdgpu kernel module is loaded (shown in lsmod) and bound to the secondary GPU. Although there's no display connected to the secondary GPU and it's "idle" I can hear the fans of the RX 580. If I then try a GPU passthrough of the RX 580 to a HVM domU the domU tries to initialize the RX 580, the fans stop spinning and with a delay of about 10 seconds dom0 crashes/freezes because it has an active amdgpu module that is still bound to the VGA device of the RX 580. AFAIK this is kind of expected that dom0 crashes if you try a PCI passthrough of a device that is still bound to dom0.

However, if I blacklist the amdgpu module from dom0 via /etc/modprobe.d/ the passthrough to domU works, although the RX 580 PCI devices are still visible to dom0. I thought that maybe amdgpu grabs the VGA device before dracut runs the 90qubes-pciback/qubes-pciback.sh script which does the evaluation of the rd.qubes.hide_pci Grub command line argument. But this doesn't seem to be the root cause why the hiding doesn't work. Anyway, blacklisting amdgpu fixes the symptom of passthrough not working, but doesn't fix the proper hiding from dom0.

dmesg -k | grep "01:00.0" -B10 -A5 doesn't show any obvious errors regarding pciback hiding:

...
[    1.222929] pci 0000:01:00.0: [1002:67df] type 00 class 0x030000
[    1.222972] pci 0000:01:00.0: reg 0x10: [mem 0xe0000000-0xefffffff 64bit pref]
[    1.222996] pci 0000:01:00.0: reg 0x18: [mem 0xf0000000-0xf01fffff 64bit pref]
[    1.223010] pci 0000:01:00.0: reg 0x20: [io  0xe000-0xe0ff]
[    1.223024] pci 0000:01:00.0: reg 0x24: [mem 0xf7e00000-0xf7e3ffff]
[    1.223038] pci 0000:01:00.0: reg 0x30: [mem 0xf7e40000-0xf7e5ffff pref]
[    1.223195] pci 0000:01:00.0: supports D1 D2
[    1.223196] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
...
[    1.234427] pci 0000:01:00.0: vgaarb: bridge control possible
[    1.234428] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    1.234429] vgaarb: loaded
...
[    1.248117] xen: registering gsi 16 triggering 0 polarity 1
[    1.248132] xen: --> pirq=16 -> irq=16 (gsi=16)
[    1.248320] xen: registering gsi 16 triggering 0 polarity 1
[    1.248323] Already setup the GSI :16
...
[    1.282552] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
[    1.282567] PCI: CLS 64 bytes, default 64
[    1.282573] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    1.282574] software IO TLB: mapped [mem 0x000000013e600000-0x0000000142600000] (64MB)
[    1.282622] Trying to unpack rootfs image as initramfs...
...
[    4.901048] pciback 0000:01:00.0: xen_pciback: seizing device
[    4.901128] pciback 0000:01:00.0: enabling device (0000 -> 0003)
[    4.901161] xen: registering gsi 16 triggering 0 polarity 1
[    4.901167] Already setup the GSI :16
...

I tried with multiple different kernel versions in Qubes 4.1 from the kernel and kernel-latest packages and even the old 5.4 leftover from the previous Qubes 4.0 install before the upgrade to 4.1. But this doesn't make a difference, hiding the RX 580 from dom0 doesn't work with any of these kernel version under 4.1, but was working on 4.0.

Seems like I'm not the only one with this issue/bug: https://forum.qubes-os.org/t/gpu-passthrough-again/14019

3hhh commented 1 year ago

Well, apparently it is loaded too late and there is some race condition with the kernel module loading the GPU driver.

If one checks /usr/lib/dracut/modules.d/, one will see that both 90kernel-modules and 90qubes-pciback exist. If the numbering has any relevance, the race condition is no surprise.

Anyway this is pretty bad indeed wrt security as VM devices shouldn't get access to dom0.

A bit related: #7886

neowutran commented 1 year ago

On my side, rd.qubes.hide_pci work as expected (R4.1 and development tree). Didn't see a difference with R4.0.

DemiMarie commented 1 year ago

@yojoe can you still reproduce this?

OwOday commented 7 months ago

this is happening to me now, after repairing my grub from a period where I could not boot

DemiMarie commented 7 months ago

Interesting.

OwOday commented 7 months ago

I was able to fix it with modprobe unload nouveau for what it's worth

DemiMarie commented 7 months ago

@OwOday what does lspci in dom0 show?

OwOday commented 6 months ago

VGA compatible controller: NVIDIA Corporation AD102 [Geforce RTX 4090]