sys-gui-gpu setup causes a hard system reset (and never works)

QubesOS / qubes-issues

The Qubes OS Project issue tracker

https://www.qubes-os.org/doc/issue-tracking/

532 stars 46 forks source link

sys-gui-gpu setup causes a hard system reset (and never works) #8657

Open emanruse opened 10 months ago

emanruse commented 10 months ago

Qubes OS release

4.1.2

Brief summary

Setting up sys-gui-gpu does not result in a working GPU GUI domain but creates the issues described below.

Steps to reproduce

Follow the documentation to setup a GPU GUI domain (sys-gui-gpu). (the computer does have an Intel VGA, integrated in the CPU).

Expected behavior

As per documentation.

Actual behavior

After rebooting, I was unable to use sys-gui-gpu - after login, it is not running. The moment I attempt to start it manually, the whole computer experiences a hard reset. I retried the whole procedure but that didn't change anything, except that the second time the system reset happened instantly at the last step:

sudo qubesctl state.sls qvm.sys-gui-gpu-attach-gpu

After booting, next attempts to start sys-gui-gpu resulted in the same - hard system reset. I looked at /var/log/qubes/*sys-gui-gpu.log files but saw nothing informative:

/var/log/qubes/qubesdb.disp-mgmt-sys-gui-gpu.log
vchan closed
reconnecting
vchan closed
/var/log/qubes/qrexec.disp-mgmt-sys-gui-gpu.log
/var/log/qubes/qubesdb.sys-gui-gpu.log
vchan closed
reconnecting
vchan closed
/var/log/qubes/mgmt-sys-gui-gpu.log
2023-09-18 14:11:58,819 calling 'state.highstate'...
2023-09-18 14:12:47,405 output: sys-gui-gpu:
2023-09-18 14:12:47,406 output: ----------
[...] 
2023-09-18 14:12:47,408 output: Summary for sys-gui-gpu
2023-09-18 14:12:47,408 output: ------------
2023-09-18 14:12:47,409 output: Succeeded: 4 (changed=4)
2023-09-18 14:12:47,409 output: Failed:    0
2023-09-18 14:12:47,409 output: ------------
2023-09-18 14:12:47,409 output: Total states run:     4
2023-09-18 14:12:47,409 output: Total run time:  66.089 ms
2023-09-18 14:12:47,409 exit code: 0
2023-09-18 15:10:16,171 calling 'state.highstate'...
/var/log/qubes/qrexec.sys-gui-gpu.log
2023-09-18 14:12:47.509 qrexec-daemon[46478]: qrexec-daemon.c:1264:main: qrexec-agent has disconnected
domain dead
2023-09-18 14:12:48.469 qrexec-daemon[46478]: qrexec-daemon.c:1149:handle_agent_restart: cannot connect to qrexec agent: No such process
2023-09-18 14:12:48.469 qrexec-daemon[46478]: qrexec-daemon.c:1266:main: Failed to reconnect to qrexec-agent, terminating

(Potentially) related issue:

https://github.com/QubesOS/qubes-issues/issues/8655

DemiMarie commented 10 months ago

You will need to ensure that the relevant kernel driver (such as i915) is not loaded in dom0. GPU drivers generally cannot be unloaded or unbound from the respective devices.

emanruse commented 10 months ago

What does this mean? What have drivers in dom0 to do with GPU if it is supposed to be handled by sys-gui-gpu (and its kernel)?

Also, how do I ensure the thing you are saying? If it must be done, why is it not explained in the docs? How do others (supposedly following the docs) run sys-gui-gpu? Do they all ensure that (by following some unofficial doc) or is it that nobody had success so far?

This is quite confusing. Could you please clarify?

lkubb commented 10 months ago

Some comments follow. Note that I have not setup either sys-gui variant so far, but I have some experience in GPU passthrough on Linux in general. Most of this will be applicable to Qubes OS as well.

What does this mean? What have drivers in dom0 to do with GPU if it is supposed to be handled by sys-gui-gpu (and its kernel)?

If a driver on dom0 attaches to the GPU first, it will be bound to dom0 and cannot be reset to attach it to any other VM. Thus you need to tell dom0 to not load the driver for your iGPU, which will prevent dom0 from binding to it.

Also, how do I ensure the thing you are saying?

In general, you can prevent drivers from being loaded by blacklisting them either in a modprobe config file (if the driver is loaded after the root filesystem) or passing a kernel parameter (if the driver is loaded from the initramfs, works for the other case as well). For an integrated GPU, I would opt for the latter. For general instructions of where/how to add this, see https://github.com/Qubes-Community/Contents/blob/master/docs/troubleshooting/nvidia-troubleshooting.md#disabling-nouveau. The only kernel parameter you will likely need is rd.driver.blacklist=i915 as per the above and the Proxmox docs. To find exactly which driver(s) you need to blacklist though, you can run lspci -k | grep -EA3 'VGA|3D|Display' on dom0 and look for your iGPU. Kernel driver in use/Kernel modules will tell you the name.

Note that this is an advanced topic and you might end up with an unusable system if you do not know what you're doing.

The other questions I cannot comment on.

emanruse commented 10 months ago

Thank you for the information.

I have done blacklisting of nouveau on Linux systems (on different hardware). However, considering the specifics of Qubes OS, I wonder if a failure will result in a text only mode (where one can undo things) or the system won't load at all persistently. So, I am scared not to end up like that, as I won't know how to restore the system to a working state.

I wonder if it would be safer (and possible on Qubes OS) to do this instead:

https://access.redhat.com/solutions/41278#EarlyBootStageModuleUnloading

as it will be a volatile change.

Before trying anything, I hope someone who has done all that can clarify that part too.

patchMonkey156 commented 5 months ago

Same issue. Selecting the sys-gui-gpu login option causes my machine to hang and reboot.

Have to block i915 and nouveau in grub2.config, and those items [rd.blacklist=i915, nouveau and module.blacklist=i915,nouveau] added in /etc/default/grub to make it more permanent when the kernel upgrades. I can clean up behind my heavy-handedness later.

I am allowed to enter my disk password as normal.

Before blacklisting those modules:

I got this far by creating & deleting sys-gui according to the guide. sys-gui worked (with errors) with the added command afterwards 'qubesprefs default_guivm sys-gui'.

I deleted sys-gui following those directions, and then followed them again for sys-gui-gpu. I then login in dom0-mode to rename sys-gui-gpu to sys-gui to have the lightdm login to be able to attempt.

I then find the specific error in this bug report.

After blacklisting those modules, but before auto starting sys-gui(-gpu)

However, now the machine fails to handover to the Plymouth login screen. With and without qubes.skip_autostart, the result is the same.

Changing to another terminal now (Alt-F2) gives me a chance to login to dom0 to start sys-gui, but the mouse and keyboard are now occupied by dom0, so I am locked out. Note that sys-gui is also a terminal.

I have a strange artifact of the blinking _ from sys-gui when when I alt+F2 back to dom0. And my screen is shifted to the right, with the right-remainder appearing on the Left. Sys-gui has taken hold, and sharing is awkward.

Auto starting sys-gui(-gpu) at boot time yields a black screen. Which, in its own way might be progress.

Is there a way to add sys-gpu-gui to dracut, and handover the mouse, keyboard, and lightdm manually, in the terminal?

Because if I can get that, we can add it to dracut, add maybe a rd.revertGUI flag to pass to dracut in grub for maintenance in a dom0-gui.

I got vanilla sys-gui to load applications manually by following directions off of this link, "dom0$: qvm-sync-appmenus " starting with templates and standalone VMs. I fully expect to do the same once this issue is solved.

Time to keep hacking at my laptop. I'm sure the /srv/formulas/base/virtual-machines-formula/qvm directory can tell me what I need to know.

In related news, the biggest reason we cannot update these scripts to fedora-39-xfce is that the package "sys-gui-xfce" is missing in 39. Disappointing, but nothing overwhelming.

preland commented 2 months ago

Here is what I’ve been trying for the last week or so, with no avail

(System: AMD Ryzen 5700U, using iGPU)

It does not seem that the GPU is actually handed off to sys-gui-gpu at all (kinda? further explained). Blacklisting the amdgpu kernel module results in it being replaced with pciback on the next boot in dom0. sys-gui-gpu does see the device, and it is even loaded with amdgpu within the guest. However, it doesn’t actually seem to be able to display anything.

A lot of confusion on my end comes partly from the fact that the screen that I’m using to debug is also rendered using the iGPU, so something is in fact rendering the screen. But neither dom0 nor sys-gui-gpu are running xserver. Killing sys-gui-gpu does not stop the display, and starting it back up doesn’t cause any visible change to the display (which is odd, as it reattaches the gpu to sys-gui-gpu). Running anything in dom0 for displaying fails.

I think that the reason for this could be that FLR is not enabled on my GPU. At least, that’s the best I can come up with. Here are the extra grub commands that I used to reach the results I got:

iommu=on iommu_amd=on rd.driver.blacklist=amdgpu rd.qubes.hide_pci=05:00.0 xen-pciback.hide(05:00.0) xen-pciback.passthrough=1 xen-pciback.permissive

I’ve tried a few different permutations of the above with no success. It should also be noted that this was ran with sys-gui-gpu set to auto start with the iGPU passed through permissive. If auto start was disabled, starting sys-gui-gpu after boot would hard reset the system. Blacklisting pciback also hard resets the system.

Also sometimes blacklisted modules would just….not be blacklisted? idk this whole experience has been rly weird lol

DemiMarie commented 2 months ago

@preland What happens if you boot with nomodeset on the Linux kernel command line?

AMD does not support PCI passthrough for their client GPUs so I am not surprised that you are having problems.

preland commented 2 months ago

Running with that command seems to get things further; now it seems to hang more at the heart of the issue: X can't find any displays (edit: to be more precise, lightdm attempts to start and then promptly exits after attempting to launch X. There is a message that also implies that plymouth doesn't exit, even though it seems to stop running anyways; likely unrelated). This occurs both when attempting to start lightdm in sys-gui-gpu (using amdgpu as the module) and in dom0(which uses pciback). The only notable difference between the two is that running lightdm/X in dom0 actually "moves" the screen from the "F2" terminal to the "F1" screen. sys-gui-gpu does nothing in terms of feedback.

By "does not support" do you mean that it isn't something that they have setup driverwise, or that their devices don't support PCI passthrough (checking because this is the first of me hearing about either)

DemiMarie commented 2 months ago

IIRC the GPU driver relies on the PSP for firmware loading, and the PSP cannot be assigned to a VM because it is not behind an IOMMU. So AMD iGPUs cannot be assigned to a VM. In the future, this will be solved by having a PSP driver built into Xen. I think.

preland commented 2 months ago

Ah, that's a shame. I'll look forward to that being implemented in Xen (as well as gpu accel when that happens :)

I'm rather new to Xen, so some things about it confuse me. How is the way that dom0 handles PCI devices different from how guests do? dom0 doesn't have any of the same issues as the guests do, even though dom0 is just a VM with administration control over the other VMs (I think?). Is it just a simple matter of dom0 receiving the PCI devices first? I'm just trying to wrap my head around how this all works 😂

marmarek commented 2 months ago

dom0 doesn't have any of the same issues as the guests do, even though dom0 is just a VM with administration control over the other VMs

Usually it's because dom0 sees full(er) picture - it sees other PCI devices (which is likely relevant for PSP), it sees full host ACPI tables, it sees actual physical addresses etc. Ideally, each device should be have everything it needs on its own, but in practice several devices (especially those built into CPU) rely on some extra info or helpers from other subsystems. And finally, there are also driver bugs due to various assumptions how physical system usually look (like what devices appear together, or what address space they use) that are simply not true in a virtual machine seeing subset of such system.