QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
536 stars 48 forks source link

Add warning re PCI numbering changes to "Devices" tab in QM #8127

Open unman opened 1 year ago

unman commented 1 year ago

How to file a helpful issue

The problem you're addressing (if any)

PCI numbering may change if devices are added or altered.(Cf #7792) This means that device allocation to qubes may change.

The solution you'd like

Absent a full solution at least provide warning when device allocation is made. A warning could also be made in the docs.

The value to a user, and who that user might be

Any user who may make hardware changes will have been warned of possible issues when they first make device allocations. Whether they remember that warning is a separate issue.

Cf Forum discussion - https://forum.qubes-os.org/t/usability-issues-with-hardware-changes-on-system-qubes/17676

marmarta commented 1 year ago

Hm, I get the idea, but also: people will 100% forget this warning. Nobody remembers warnings from - possibly - months before.

DemiMarie commented 1 year ago

@marmarek @HW42 Can assignment be based on PCIe path rather than bus-slot-function?

deeplow commented 1 year ago

Yes a warning probably not the way to go about this. A technical solution is much better than documentation or notes through the user interface.

And I see that sometimes PCIe devices numbering (within the VM) is randomized at boot of the host. This makes some of my HVMs sometimes boot and other times not boot (the passed-through graphics card is assigned to one slot).

@neowutran provides a way to circumvent that (changing the Xorg config on boot) here, but that's cumbersome. In my case I just connect the qube to a netVM (from none), which adds a virtual PCIe network card before, making the GPU one be in the right place.

A permanent solution to this (having some sort of PCIe numbering attribution) would make it easier to use. Maybe the slots could be user-assignable? Here's a (terrible) mockup of what I mean:

penpot

(for those who can't see the picture, it's basically a list of devices on the left and numbered slots on the right, where the user can drag and drop the devices).

v6ak commented 11 months ago

I was hit by this issue.

My story behind that

I am trying to resolve cooling issues with my SSD. It was under GPU, so I cannot add a heatsink. So, I decided to move GPU.

What happened

The QubesOS started booting as usual. However, after entering the password, the system always rebooted after few seconds.

After some trials and errors (including reinserting the GPU), I've added nomodeset parameter to the kernel in grub. Qubes OS was able to boot then. Also, it has helped me to identify the root issue: devices were renumbered and GPU was assignet to my NetVM instead of one of my network cards. (I am not sure how could nomodeset have helped.)

Further investigations

I've looked at qvm-pci output in order to find out whether there are any other unwanted assignments.

Proposed solution

  1. Store some additional metadata about the PCI device, so we can detect if there is some other PCI device. (It might not be perfect: there may be two identical types of PCI devices. If they get swapped, this might not be detected.)
  2. Add some extra check, so that when a qube with some PCI devices boots, Qubes OS detects that there is some other PCI device. In this case, Qubes OS shouldn't assign the device and show an alert. (I don't have a strong opinion if it should prevent the whole qube from starting, or if it should start without one PCI device. Maybe it shouldn't start at all because of in-qube PCI devices renumbering.)

Workaround

  1. Before HW configuration change, disable all autostarts.
  2. Adjust the HW configuration and boot.
  3. Check output of qvm-pci whether all devices are assigned correctly.
  4. Restore autostarts.

Note that this workaround might not be applicable if some PCI device dies, as user cannot do step no. 1. (Assuming that the user cannot predict the device death.) EDIT: It seems that you can disable autostarts even if you haven't done step no. 1 in advance. (I haven't tested that, though.)