QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
543 stars 48 forks source link

S0ix suspend issues on Novacustom V5xx #9372

Open marmarek opened 4 months ago

marmarek commented 4 months ago

How to file a helpful issue

Qubes OS release

R4.2 / R4.3

Brief summary

Using S0ix results in a broken system after resume.

Steps to reproduce

  1. Enable S0ix according to https://github.com/QubesOS/qubes-issues/issues/6411
  2. Suspend the system

Expected behavior

System correctly suspends, and is fully functional after resume

Actual behavior

Power LED blinks, but according to /sys/kernel/debug/pmc_core/substate_residencies it didn't actually suspend. /sys/kernel/debug/pmc_core/substate_requirements also has empty "status" column next to all requirements.

After resume wired network is broken. sys-net logs have:

sys-net logs

``` [2024-07-22 12:45:30] [ 237.842643] e1000e 0000:00:07.0 ens7: NIC Link is Down [2024-07-22 12:45:30] [ 237.863042] Freezing user space processes [2024-07-22 12:45:30] [ 237.864602] Freezing user space processes completed (elapsed 0.001 seconds) [2024-07-22 12:45:30] [ 237.864626] OOM killer disabled. [2024-07-22 12:45:30] [ 237.864637] Freezing remaining freezable tasks [2024-07-22 12:45:30] [ 237.865584] Freezing remaining freezable tasks completed (elapsed 0.000 seconds) [2024-07-22 12:45:30] [ 237.865607] xen:manage: Using suspend/resume for sleep/wakeup [2024-07-22 12:45:30] [ 237.868291] e1000e: EEE TX LPI TIMER: 00000011 [2024-07-22 12:46:34] [ 237.935960] xen:grant_table: Grant tables using version 1 layout [2024-07-22 12:46:34] [ 237.983971] iwlwifi 0000:00:06.0: WRT: Invalid buffer destination [2024-07-22 12:46:34] [ 238.141605] iwlwifi 0000:00:06.0: Not valid error log pointer 0x0024B5C0 for RT uCode [2024-07-22 12:46:34] [ 238.141784] iwlwifi 0000:00:06.0: WFPM_UMAC_PD_NOTIFICATION: 0x1f [2024-07-22 12:46:34] [ 238.141818] iwlwifi 0000:00:06.0: WFPM_LMAC2_PD_NOTIFICATION: 0x1f [2024-07-22 12:46:34] [ 238.141849] iwlwifi 0000:00:06.0: WFPM_AUTH_KEY_0: 0x80 [2024-07-22 12:46:34] [ 238.141874] iwlwifi 0000:00:06.0: CNVI_SCU_SEQ_DATA_DW9: 0x0 [2024-07-22 12:46:34] [ 238.142487] iwlwifi 0000:00:06.0: RFIm is deactivated, reason = 4 [2024-07-22 12:46:37] [ 240.729199] e1000e 0000:00:07.0 ens7: Failed to disable ULP [2024-07-22 12:48:46] [ 369.728111] e1000e 0000:00:07.0 ens7: Hardware Error [2024-07-22 12:48:46] [ 369.728146] e1000e 0000:00:07.0 ens7: Timesync Tx Control register not set as expected [2024-07-22 12:48:46] [ 369.829179] e1000e 0000:00:07.0: EEE advertisement - unable to acquire PHY [2024-07-22 12:48:46] [ 369.832451] OOM killer enabled. [2024-07-22 12:48:46] [ 369.832458] Restarting tasks ... done. ```

After resume, sys-net was semi-frozen from some time (over a minute), qubes.SuspendPost service failed (due to vchan timeout). qvm-run --nogui appears to work, but I'm not 100% sure if it's only because I tried it later.

Wireless appears to be functional (at least listing available networks work).

sys-usb appears to be functional.

marmarek commented 4 months ago

Reloading e1000e module in sys-net does not help.

marmarek commented 4 months ago

Ugh...

drivers/net/ethernet/intel/e1000e/ich8lan.c:

        /* It is not possible to be certain of the current state of ULP
         * so forcibly disable it.
         */
        hw->dev_spec.ich8lan.ulp_state = e1000_ulp_state_unknown;
        ret_val = e1000_disable_ulp_lpt_lp(hw, true);
        if (ret_val)
                e_warn("Failed to disable ULP\n");
...
/**     
 *  e1000_disable_ulp_lpt_lp - unconfigure Ultra Low Power mode for LynxPoint-LP
 *  @hw: pointer to the HW structure
 *  @force: boolean indicating whether or not to force disabling ULP
 *
 *  Un-configure ULP mode when link is up, the system is transitioned from
 *  Sx or the driver is unloaded.  If on a Manageability Engine (ME) enabled
 *  system, poll for an indication from ME that ULP has been un-configured.
 *  If not on an ME enabled system, un-configure the ULP mode by software.
 *      
 *  During nominal operation, this function is called when link is acquired
 *  to disable ULP mode (force=false); otherwise, for example when unloading
 *  the driver or during Sx->S0 transitions, this is called with force=true
 *  to forcibly disable ULP.
 */     
static s32 e1000_disable_ulp_lpt_lp(struct e1000_hw *hw, bool force)
{       
...
                if (force) {
                        /* Request ME un-configure ULP mode in the PHY */
                        mac_reg = er32(H2ME);
                        mac_reg &= ~E1000_H2ME_ULP;
                        mac_reg |= E1000_H2ME_ENFORCE_SETTINGS;
                        ew32(H2ME, mac_reg);
                }

But, ew32(H2ME, ...) actually writes to the lan device register, not a separate device - here, in bar0:

#define E1000_H2ME              0x05B50 /* Host to ME */
#define E1000_H2ME_START_DPG    0x00000001      /* indicate the ME of DPG */
#define E1000_H2ME_EXIT_DPG     0x00000002      /* indicate the ME exit DPG */
#define E1000_H2ME_ULP          0x00000800      /* ULP Indication Bit */
#define E1000_H2ME_ENFORCE_SETTINGS     0x00001000      /* Enforce Settings */

It's not clear to me how they communicate, but maybe assigning device to the VM breaks this communication?

Or maybe it's more generic problem. When it happens I see a mismatch in memory decoding (see Mem+ or Mem- in Control, and also [disabled] next to Region 0:

sys-net: lspci -vvs 7.0
00:07.0 Ethernet controller: Intel Corporation Device 550a (rev 20)
    Subsystem: CLEVO/KAPOK Computer Device a743
    Physical Slot: 7
    Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin D routed to IRQ 47
    Region 0: Memory at f2000000 (32-bit, non-prefetchable) [size=128K]
    Capabilities: [c8] Power Management version 3
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
    Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Kernel modules: e1000e

dom0: lspci -vvs 1f.6
00:1f.6 Ethernet controller: Intel Corporation Device 550a (rev 20)
    DeviceName: Ethernet controller
    Subsystem: CLEVO/KAPOK Computer Device a743
    Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin D routed to IRQ 21
    Region 0: Memory at b54a0000 (32-bit, non-prefetchable) [disabled] [size=128K]
    Capabilities: [c8] Power Management version 3
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
    Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 00000000fee01458  Data: 0000
    Capabilities: [e0] PCI Advanced Features
        AFCap: TP+ FLR+
        AFCtrl: FLR-
        AFStatus: TP-
    Kernel driver in use: pciback
    Kernel modules: e1000e
marmarek commented 4 months ago

Or maybe it's more generic problem. When it happens I see a mismatch in memory decoding (see Mem+ or Mem- in Control, and also [disabled] next to Region 0:

That's it, re-enabling memory decoding in dom0 makes device working again. Worth checking if https://github.com/QubesOS/qubes-issues/issues/6411#issuecomment-1970270582 isn't the same problem. FYI @HW42

wessel-novacustom commented 2 months ago

Is S3 working fine? If so, is any post installation step needed?

marmarek commented 2 months ago

S3 works fine and should be active by default on V5xx series, no manual steps are required. I keep this issue open because I would like to make S0ix working too at some point, but that shouldn't affect users.

wessel-novacustom commented 2 months ago

S3 works fine and should be active by default on V5xx series, no manual steps are required. I keep this issue open because I would like to make S0ix working too at some point, but that shouldn't affect users.

I'm positively surprised about that. Great!