QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
542 stars 48 forks source link

QubesOS 4.2 suspend and pcie device assignment (also in case of latest kernel, boot) BROKEN on GPD WinMax 2 devices (G1619-03 G1619-04) culprit device narrowed down #9584

Open LindaFerum opened 5 days ago

LindaFerum commented 5 days ago

Qubes OS release

Qubes OS 4.2 (fully updated)

Xen 4.17.5

kernel 6.6.54-1.qubes.fc37.x86_64 (sort of works, but with problem, see below)

kernel 6.11.2-1.qubes.fc37.x86_64 (same problem but arises immediately upon entering user's password at desktop screen, making system completely unusable)

Hardware as identified by Qubes OS itself (HCL)

Brand: GPD Model: G1619-04

CPU: AMD Ryzen 7 6800U with Radeon Graphics Chipset: Advanced Micro Devices, Inc. [AMD] Family 17h-19h PCIe Root Complex [1022:14b5] (rev 01) Graphics: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] [1002:1681] (rev c1) (prog-if 00 [VGA controller])

RAM: 29437 Mb

QubesOS version: R4.2.3 BIOS: 1.05 Kernel: 6.6.54-1 Xen: unknown (this is weird ? )

Hardware, according to device model ID written on unit's bottom cover: Model G1619-03 (the mismatch between model ID on the physical unit and the one reported by Qubes is peculiar) AMD Ryzen 7 6800U with Radeon Graphics 32 GB RAM Radeon 680M

Brief summary

Initial problem was that the system was unable to wake up from sleep (even after disabling all the shady wakeup behaviors that commonly keep this laptop from sleeping, e.g. everything except keyboard)

The device would go to sleep normally but during wakeup it would enter "maximum performance" mode (extra noise and hot) and screen would go black.

Later, similar behavior manifested when assigning a particular PCIe device to a VM

Culprit device is identified as : AMD Rembrandt USB4 XHCI controller #4

(XHC4 in acpitool), aka pci:0000:74:00.4

If this device gets assigned to a VM and VM starts, the laptop would exhibit same sudden "extreme hot and noisy" burst followed by screen rapidly becoming "laggy" and finally going black. After screen goes black device can only be recovered by forced reboot.

Assignment options (permissive, strict reset) do not help anything.

Later I decided to try it with kernel-latest (6.11.2-1.qubes.fc37.x86_64) and situation is much, much worse there.

The "burst of heat and noise followed by black screen" arises immediately after the system boots to desktop, making the device unusable.

Steps to reproduce

1) Have a Ryzen 7 6800U computer with a AMD Rembrandt USB4 XHCI controller #4 device ,

ideally a GPD Win Max 2 model number G1619-03 (though G1619-04 may also be affected)

2) assigning that AMD Rembrandt USB4 XHCI controller #4 to any VM and

3) start the aforementioned VM

4) observe immediate degradation of performance, weird noise behavior, screen going black and complete lockup

ALTERNATIVELY

1) same as above 2) without assigning the culprit device to any VM try suspending to RAM 3) attempt wake-up 4) observe black screen event from which system can not recover

ALTERNATIVELY

1) same as above 2) just run QubesOS with 6.11.2-1.qubes.fc37.x86_64 kernel 3) immediate degradation and blackscreen upon reaching user's password entry

All variants reproduce reliably

Expected behavior

Suspend and resume working, at very least (ideally assignment of controller to VM working too but I can sort of live without that - some other USB controllers are assigned okay and I can live with that)

6.11.2-1.qubes.fc37.x86_64 kernel working too

Actual behavior

Suspend and resume cause immediate disaster (black screen) presumably due to AMD Rembrandt USB4 XHCI controller shenanigans

6.11.2-1.qubes.fc37.x86_64 same regardless of suspend/resume/VM assignment presumably due to same device

I don't know which logs would be appropriate here and how to best catch them, but given that I can reliably reproduce the behavior, please let me know how to grab the logs that are most likely to be useful and I will do my best to grab them

LindaFerum commented 4 days ago

Okay ! I did catch some logs (and also figured out how to make it boot up with kernel 6.11.2-1.qubes.fc37.x86_64 - kinda - apparently the trick seems to be not to autostart any VMs with PCI passthrough)

First, a log of trying to pass the annoying AMD Rembrandt USB4 XHCI controller #4 to a VM (named "worst-usb"), causing eventual hang and blackscreen journalctl: journalctl-bad-usb-launch.txt

and the VM itself: guest-worst-usb.log

Now, other USB controllers also have problems if you start them too early (same apparent symptoms, rapid degradation, blackscreen and lock up)

Journalctl:

sudden-fail-other-USB-journalctl.txt

And finally, the one that annoys and vexes me the most, the journalctl from the situation where the machine is made to go to sleep and then awoken (behaviorally the keyboard lights up, the fan spins up wildly and the led goes to "normal operation" (non blinking) signal BUT log looks like it never even tried to wake up. A terribly unpleasant conundrum, help would be very appreciated)

sleep-wakeup-failure.txt

marmarek commented 4 days ago

Based on PCI device address (73:00.4), this USB controller seems to be part of your GPU (73:00.0). I guess the GPU (or its driver) is not happy about taking away its part. Theoretically they should work separately, but in practice some devices do assume different functions of the same device are handled by the same kernel/VM. Looks like you got such a case here. In practice, it means you need to keep all the devices 73:00.* in the same place, not assign some of them to sys-usb - if that's dom0, be it dom0. It isn't ideal for security, but well, looks like you hardware doesn't allow any better. To limit the impact of those USB controllers that stay in dom0, add qubes.rd.hide_pci=73:00.3,73:00.4 to the kernel cmdline to not attach normal drivers to them. Of course assuming you don't need to use them (monitor on USB-C port should still work).