QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
541 stars 48 forks source link

Hardware reset during installation and boot of R4.2 on Ryzen 9 7950X #8322

Open Eric678 opened 1 year ago

Eric678 commented 1 year ago

How to file a helpful issue

Qubes OS release

R4.2.0-rc1 + Ryzen 9 7950X + Gigabyte X670E motherboard

Brief summary

Installation proceeds normally till just after "Configure networking" when hardware resets. Further system boots reset just after entering disk password.

Steps to reproduce

Run a default installation of R4.2.0-rc1.

Expected behavior

No hardware resets.

Actual behavior

As noted.

Problem appears to be caused by a single USB controller being mapped into sys-usb. There are 5 USB controllers on the CPU and 670 chipset, only one causes a problem. It is the last one in the devices list, address 37:00.0.

Workaround is to add qubes.skip_autostart option to the linux kernel boot parameters at any boot after installation, then unmap this controller from sys-usb once system is up.

I suspect that it is the on CPU controller that is used for the mouse and keyboard as others on different VM systems on the same CPU have a problem mapping running USB devices causing a hardware reset.

DemiMarie commented 1 year ago

I suspect this will need a hardware quirk in the installer.

Eric678 commented 1 year ago

A simpler workaround turns out to be leaving IOMMU disabled during installation (the above MB defaults to auto and does not know about Qubes) then installation exits seconds before getting the hardware reset with a missing IOMMU error starting sys-firewall - presuming it was sys-net actually. Installation exits cleanly and one can immediately log in and remove the USB controller from sys-usb. I have no idea what I am missing out on in the install, this technically invalidates all further testing of R4.2. It does seem to work rather well actually...

Eric678 commented 1 year ago

Quick check on rc3 and still there, however a clean install can be made by adding the "qubes.skip_autostart" option to vmlinuz on 2nd pass of installation. The installer does take notice, oddly sys_usb is not started and sys_firewall & sys_net are, probably a bug. Just take last USB controller out of sys_usb and start it and proceed as normal. Only problems I am having with rc3 is with USB devices being a bit flakey, may be related to whatever this problem is.

DemiMarie commented 1 year ago

How would one add the needed quirk to Anaconda?

0spinboson commented 1 year ago

Is there a phase during installation where the installer boots sys-usb after assigning all usb devices to it?

marmarek commented 1 year ago

How would one add the needed quirk to Anaconda?

I don't think it's the right thing to do, at least with the current info here. It would potentially leave dom0 exposed to some USB devices, while user would have impression they are all isolated in sys-usb (since that was selected during install). The proper solution is ofc make it not crash. But as a workaround user can choose to not create sys-usb during install, and later create it by hand and remove the device from there. This way they will know some device is excluded and there is no risk of leaving it in dom0 without user knowledge. Such instruction should also explain the risk.

But, if the device really should stay in dom0, not as a workaround for a crash, but as really intended behavior, then we have a mechanism for that - rd.qubes.dom0_usb=37:00.0 (example value) option to the kernel. It will leave this controller in dom0, and also salt will respect this setting when creating sys-usb. It can be added to the kernel at the start of installation in grub menu (anaconda will carry the kernel option to the final system too), or maybe somewhere within anaconda automatically (of which I'm very much not convinced it's the right thing to do).

DemiMarie commented 1 year ago

Has this been reported to Gigabyte? I wonder if SMM is getting an interrupt it did not expect to get and crashes as a result.

DemiMarie commented 1 year ago

How would one add the needed quirk to Anaconda?

I don't think it's the right thing to do, at least with the current info here. It would potentially leave dom0 exposed to some USB devices, while user would have impression they are all isolated in sys-usb (since that was selected during install). The proper solution is ofc make it not crash. But as a workaround user can choose to not create sys-usb during install, and later create it by hand and remove the device from there. This way they will know some device is excluded and there is no risk of leaving it in dom0 without user knowledge. Such instruction should also explain the risk.

What if the device was attached to nothing? Don’t assign it to sys-usb, but don’t assign it to any other qube (including dom0) either. Assign it to Xen’s quarantine domain. That might avoid the crash without the security consequences.

Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.

marmarek commented 1 year ago

Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.

That's highly unlikely. A much more likely cause is either dom0 or xen panic...

And still, I don't want wasting time on elaborate workarounds (there are already a few simple ones in this thread), until we know for sure proper fix is not achievable.

DemiMarie commented 1 year ago

Alternatively, what if Linux is told to not reset the device? I wonder if Linux sees that a PM reset is available, but that PM reset winds up resetting the whole system.

That's highly unlikely. A much more likely cause is either dom0 or xen panic...

And still, I don't want wasting time on elaborate workarounds (there are already a few simple ones in this thread), until we know for sure proper fix is not achievable.

Is “assign to quarantine domain” simple or elaborate?

brxken128 commented 1 year ago

This is reproducible on my 7950X with an Asus Strix X670E-F, so I don't thnk it's Gigabyte-specific. I also have a 7900XTX which may not be helping things.

Tehvan commented 1 year ago

Also happens to me on 7950X with Asrock X670E Steel Legend. I have two USB controllers that cause a reboot -- 16:00.4 and 17:00.0

neowutran commented 1 year ago

I have the same issue with my Asus Strix X670E-F. I have one "USB controller" that always cause a reboot : 12:00.0

However I am not sure of what it is really. I tried every USB port on my setup, everything work, without this "USB controller".

( I have two unused internal USB 2.0 port on my motherboard. I have one USB controller that I can passthrough in qubes os, but this controller never receive any usb device, I suspect it is the USB controller for my two unused internal USB 2.0 port. )

For the peoples having this issue, are you missing any USB port / functionality without the "USB controller" that you cannot passthrough ?

Result of "sudo lsci -vvs 12:00.0"

12:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b8 (prog-if 30 [XHCI])
    Subsystem: ASUSTeK Computer Inc. Device 8877
    Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin A routed to IRQ 46
    Region 0: Memory at fc000000 (64-bit, non-prefetchable) [size=1M]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
    Capabilities: [64] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
        DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 16GT/s, Width x16
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [a0] MSI: Enable- Count=1/8 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
        Vector table: BAR=0 offset=000fe000
        PBA: BAR=0 offset=000ff000
    Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [270 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [2a0 v1] Access Control Services
        ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
    Capabilities: [450 v1] Lane Margining at the Receiver <?>
    Kernel driver in use: xhci_hcd
    Kernel modules: xhci_pci

The uncommon lines in this:

Tehvan commented 1 year ago

On mine the 17.00.0 is the Motherboard LED controller. But since there is no problem when not using sys-usb, it should be a passthrough problem (i.e. iommu groups) right?

0spinboson commented 1 year ago

iommu groups or soft reset

Eric678 commented 1 year ago

4.2-rc4 6.5.6 still there. Behavior is different - normal install, I left machine for 2nd pass and when I returned much later it was shut down. Bringing it up with qubes.skip_autostart there were 3 USB controllers in sys-usb that were unknown and all had to be removed for it to start. Guessing not everything made it disk before the reset. 2nd try with qubes.skip_autostart to 2nd pass, completed the Anaconda progress bar, dropped back to console, finished systemd-tmpfiles-clean.service, then stuck at "Job initial-setup.service/start running" for a couple of hours before I reset the machine. Took last USB controller out of sys-usb and all seemed OK. All USB ports appear to be working (13 exposed on outside of motherboard including mouse and keyboard + 1 I am using on the motherboard internally). There is definitely a problem with writing USB storage devices that I will post separately.

[ed] While writing up that issue I had a different event: an instant power off while typing here. Had been doing various testing on USB ports and had left a storage device plugged into one of the controllers on the 670 chipset. On trying to boot I got the same power off after entering the disk password, suspecting sys-usb, I took a couple more devices out and could then get up and running and then noticed the USB drive on the back panel, removed it and could put those devices back in sys-usb and boot OK. So it looks like all it takes is for a device to be plugged into a port that is mapped to sys-usb to cause a reset or power off on start. I did plug the mouse and keyboard into the only 2 ports that are USB 2.0/1.1 that are on a USB 2.0 hub direct on the CPU, hence my original suspicion.

Eric678 commented 11 months ago

rc5-latest test did not get very far: debian-12-xfce: qubes.PostInstall service failed. See attached. No other reports? Media OK. Installing encrypted on SATA SSD while another copy (current stable) encrypted on different drive. This worked above for rc4. 20231203

Eric678 commented 10 months ago

4.2.0 6.6.2 did not have above installation problem. Still get a power off starting sys-usb if the last USB device is mapped. Not getting the power off/reset if a storage device is plugged into another controller when sys-usb is started, however sys-usb does go into a loop: device available, device removed notifications every second that is cleared by removing the storage device. Note sys_net and sys_firewall are autostarted even if qubes.skip_autostart is passed to the kernel.

krystian-hebel commented 10 months ago

I can see the same on Supermicro M11SDV-4C-LN4F, here's log from serial from attempted boot that resulted in hard restart: xen.log

No panic, nothing unexpected in the last lines. I'm not sure why first lines (5th and 6th) look as they do. I had issue with another Supermicro board (X11-something) where the output was heavily modified by BMC (lines printed out of order with heavy jumping with ANSI escape codes, \n without \r or \n after each character depending on BIOS settings etc.), but here everything seems to work reliably, except those two lines.

I can start the OS with qubes.skip_autostart and sys-usb starts only with USB controller disabled. Unfortunately, this platform has just one controller and most likely I'll need it at some point.

mahakal commented 1 month ago

The same issue applies to Legion 5 Pro: The USB controller has its own IOMMU group without any other device. You can check the IOMMU grouping in the Legion 5 Pro HCL.

Eric678 commented 1 month ago

Just confirm quick test of 4.2.3 latest kernel installation - problem still present. Passing qubes.skip_autostart allows P2 instal to complete (4.2.2 hung on the console after Anaconda). sys-net & firewall still start when they should not. After up and running, moving the last USB controller back into sys-usb caused a device not found error: two of the other controllers somehow changed their device numbers while sys-usb was restarted. Moving the updated devices in then caused the hardware reset on starting sys-usb.

DemiMarie commented 1 month ago

There are multiple problems here:

inao-cz commented 1 week ago

Can confirm this issue is present on latest QubesOS release 4.2.3 with latest kernel on this system: MSI Tomahawk X670E WIFI AMD Ryzen 9 7950X

disabling autostart was required for a successful installation, otherwise system would hang during sys-usb start

renehoj commented 1 week ago

I could install on the MSI Tomahawk X670E with BT+WiFi disabled in the firmware, without having to disable autostart.