Open Eric678 opened 1 year ago
I suspect this will need a hardware quirk in the installer.
A simpler workaround turns out to be leaving IOMMU disabled during installation (the above MB defaults to auto and does not know about Qubes) then installation exits seconds before getting the hardware reset with a missing IOMMU error starting sys-firewall - presuming it was sys-net actually. Installation exits cleanly and one can immediately log in and remove the USB controller from sys-usb. I have no idea what I am missing out on in the install, this technically invalidates all further testing of R4.2. It does seem to work rather well actually...
Quick check on rc3 and still there, however a clean install can be made by adding the "qubes.skip_autostart" option to vmlinuz on 2nd pass of installation. The installer does take notice, oddly sys_usb is not started and sys_firewall & sys_net are, probably a bug. Just take last USB controller out of sys_usb and start it and proceed as normal. Only problems I am having with rc3 is with USB devices being a bit flakey, may be related to whatever this problem is.
How would one add the needed quirk to Anaconda?
Is there a phase during installation where the installer boots sys-usb after assigning all usb devices to it?
How would one add the needed quirk to Anaconda?
I don't think it's the right thing to do, at least with the current info here. It would potentially leave dom0 exposed to some USB devices, while user would have impression they are all isolated in sys-usb (since that was selected during install). The proper solution is ofc make it not crash. But as a workaround user can choose to not create sys-usb during install, and later create it by hand and remove the device from there. This way they will know some device is excluded and there is no risk of leaving it in dom0 without user knowledge. Such instruction should also explain the risk.
But, if the device really should stay in dom0, not as a workaround for a crash, but as really intended behavior, then we have a mechanism for that - rd.qubes.dom0_usb=37:00.0
(example value) option to the kernel. It will leave this controller in dom0, and also salt will respect this setting when creating sys-usb. It can be added to the kernel at the start of installation in grub menu (anaconda will carry the kernel option to the final system too), or maybe somewhere within anaconda automatically (of which I'm very much not convinced it's the right thing to do).
Has this been reported to Gigabyte? I wonder if SMM is getting an interrupt it did not expect to get and crashes as a result.
How would one add the needed quirk to Anaconda?
I don't think it's the right thing to do, at least with the current info here. It would potentially leave dom0 exposed to some USB devices, while user would have impression they are all isolated in sys-usb (since that was selected during install). The proper solution is ofc make it not crash. But as a workaround user can choose to not create sys-usb during install, and later create it by hand and remove the device from there. This way they will know some device is excluded and there is no risk of leaving it in dom0 without user knowledge. Such instruction should also explain the risk.
What if the device was attached to nothing? Don’t assign it to sys-usb, but don’t assign it to any other qube (including dom0) either. Assign it to Xen’s quarantine domain. That might avoid the crash without the security consequences.
Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.
Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.
That's highly unlikely. A much more likely cause is either dom0 or xen panic...
And still, I don't want wasting time on elaborate workarounds (there are already a few simple ones in this thread), until we know for sure proper fix is not achievable.
Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.
That's highly unlikely. A much more likely cause is either dom0 or xen panic...
And still, I don't want wasting time on elaborate workarounds (there are already a few simple ones in this thread), until we know for sure proper fix is not achievable.
Is “assign to quarantine domain” simple or elaborate?
This is reproducible on my 7950X with an Asus Strix X670E-F, so I don't thnk it's Gigabyte-specific. I also have a 7900XTX which may not be helping things.
Also happens to me on 7950X with Asrock X670E Steel Legend. I have two USB controllers that cause a reboot -- 16:00.4 and 17:00.0
I have the same issue with my Asus Strix X670E-F. I have one "USB controller" that always cause a reboot : 12:00.0
However I am not sure of what it is really. I tried every USB port on my setup, everything work, without this "USB controller".
( I have two unused internal USB 2.0 port on my motherboard. I have one USB controller that I can passthrough in qubes os, but this controller never receive any usb device, I suspect it is the USB controller for my two unused internal USB 2.0 port. )
For the peoples having this issue, are you missing any USB port / functionality without the "USB controller" that you cannot passthrough ?
Result of "sudo lsci -vvs 12:00.0"
12:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b8 (prog-if 30 [XHCI])
Subsystem: ASUSTeK Computer Inc. Device 8877
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 46
Region 0: Memory at fc000000 (64-bit, non-prefetchable) [size=1M]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
Capabilities: [64] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s, Width x16
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a0] MSI: Enable- Count=1/8 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
Vector table: BAR=0 offset=000fe000
PBA: BAR=0 offset=000ff000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [270 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [2a0 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [450 v1] Lane Margining at the Receiver <?>
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
The uncommon lines in this:
On mine the 17.00.0 is the Motherboard LED controller. But since there is no problem when not using sys-usb, it should be a passthrough problem (i.e. iommu groups) right?
iommu groups or soft reset
4.2-rc4 6.5.6 still there. Behavior is different - normal install, I left machine for 2nd pass and when I returned much later it was shut down. Bringing it up with qubes.skip_autostart there were 3 USB controllers in sys-usb that were unknown and all had to be removed for it to start. Guessing not everything made it disk before the reset. 2nd try with qubes.skip_autostart to 2nd pass, completed the Anaconda progress bar, dropped back to console, finished systemd-tmpfiles-clean.service, then stuck at "Job initial-setup.service/start running" for a couple of hours before I reset the machine. Took last USB controller out of sys-usb and all seemed OK. All USB ports appear to be working (13 exposed on outside of motherboard including mouse and keyboard + 1 I am using on the motherboard internally). There is definitely a problem with writing USB storage devices that I will post separately.
[ed] While writing up that issue I had a different event: an instant power off while typing here. Had been doing various testing on USB ports and had left a storage device plugged into one of the controllers on the 670 chipset. On trying to boot I got the same power off after entering the disk password, suspecting sys-usb, I took a couple more devices out and could then get up and running and then noticed the USB drive on the back panel, removed it and could put those devices back in sys-usb and boot OK. So it looks like all it takes is for a device to be plugged into a port that is mapped to sys-usb to cause a reset or power off on start. I did plug the mouse and keyboard into the only 2 ports that are USB 2.0/1.1 that are on a USB 2.0 hub direct on the CPU, hence my original suspicion.
rc5-latest test did not get very far: debian-12-xfce: qubes.PostInstall service failed. See attached. No other reports? Media OK. Installing encrypted on SATA SSD while another copy (current stable) encrypted on different drive. This worked above for rc4.
4.2.0 6.6.2 did not have above installation problem. Still get a power off starting sys-usb if the last USB device is mapped. Not getting the power off/reset if a storage device is plugged into another controller when sys-usb is started, however sys-usb does go into a loop: device available, device removed notifications every second that is cleared by removing the storage device. Note sys_net and sys_firewall are autostarted even if qubes.skip_autostart is passed to the kernel.
I can see the same on Supermicro M11SDV-4C-LN4F, here's log from serial from attempted boot that resulted in hard restart: xen.log
No panic, nothing unexpected in the last lines. I'm not sure why first lines (5th and 6th) look as they do. I had issue with another Supermicro board (X11-something) where the output was heavily modified by BMC (lines printed out of order with heavy jumping with ANSI escape codes, \n
without \r
or \n
after each character depending on BIOS settings etc.), but here everything seems to work reliably, except those two lines.
I can start the OS with qubes.skip_autostart
and sys-usb
starts only with USB controller disabled. Unfortunately, this platform has just one controller and most likely I'll need it at some point.
The same issue applies to Legion 5 Pro: The USB controller has its own IOMMU group without any other device. You can check the IOMMU grouping in the Legion 5 Pro HCL.
Just confirm quick test of 4.2.3 latest kernel installation - problem still present. Passing qubes.skip_autostart allows P2 instal to complete (4.2.2 hung on the console after Anaconda). sys-net & firewall still start when they should not. After up and running, moving the last USB controller back into sys-usb caused a device not found error: two of the other controllers somehow changed their device numbers while sys-usb was restarted. Moving the updated devices in then caused the hardware reset on starting sys-usb.
There are multiple problems here:
Can confirm this issue is present on latest QubesOS release 4.2.3 with latest kernel on this system: MSI Tomahawk X670E WIFI AMD Ryzen 9 7950X
disabling autostart was required for a successful installation, otherwise system would hang during sys-usb start
I could install on the MSI Tomahawk X670E with BT+WiFi disabled in the firmware, without having to disable autostart.
How to file a helpful issue
Qubes OS release
R4.2.0-rc1 + Ryzen 9 7950X + Gigabyte X670E motherboard
Brief summary
Installation proceeds normally till just after "Configure networking" when hardware resets. Further system boots reset just after entering disk password.
Steps to reproduce
Run a default installation of R4.2.0-rc1.
Expected behavior
No hardware resets.
Actual behavior
As noted.
Problem appears to be caused by a single USB controller being mapped into sys-usb. There are 5 USB controllers on the CPU and 670 chipset, only one causes a problem. It is the last one in the devices list, address 37:00.0.
Workaround is to add qubes.skip_autostart option to the linux kernel boot parameters at any boot after installation, then unmap this controller from sys-usb once system is up.
I suspect that it is the on CPU controller that is used for the mouse and keyboard as others on different VM systems on the same CPU have a problem mapping running USB devices causing a hardware reset.