QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
541 stars 48 forks source link

Suspend to RAM randomly crashes VM in R4.1 #7283

Closed logoerthiner1 closed 2 years ago

logoerthiner1 commented 2 years ago

How to file a helpful issue

I installed R4.1 on my new ThinkPad L15 Gen2 and figured out the way to enable Linux S3 sleep. Linux S3 sleep works on windows, but on Qubes OS (actually I was surprised to see that suspension - waking up partially works) it encountered many problems.

Basically suspend (1) immediately panics my sys-net because of a MT7921 driver NULL pointer dereference happening inside kernel-latest (5.15.14-1.fc32), (2) randomly kill some of my running appVMs, and (3) cause a long time lag when waking up (~15s, from pressing power button to xscreensaver screen, with the power button light blinking as if it was still suspending)

I would like to try handling by myself (1), and (3) does not seems easy to figure out, so this issue focus on (2).

It might be related to #6411 since my CPU is i5-1135G7.

Qubes OS release

R4.1

Brief summary

Suspend randomly shutdown running VMs, by Xen:

d?v0 Unexpected vmexit: reason 3
domain_crash called from vmx.c:4304

Steps to reproduce

Suspend, and wake up

Expected behavior

After waking up, all running VM before suspension is still running.

Actual behavior

Some random VMs are shutdown. It seems like the VM is shutdown when waking up, but I cannot rule out that the crash happens when suspending

Log files for reference

https://pastebin.mozilla.org/84hYCS5v

Edit

I tried disabling HyperThreading, and it only cuts the lag time of (3) in half; VM still randomly exits.

logoerthiner1 commented 2 years ago

I update my discovery here.

I no-op'ed the internal.SuspendPost (something like this, at /usr/lib/python3.8/qubes/api/internal.py) logic and most of internal.SuspendPre (only keeping the suspend part and remove the SuspendPre message broadcasting) so that I can pause and unpause the VM manually before and after suspension, and then I suspend and wake up the machine.

The result is: waking up still takes too long time, and after waking up (all the VM are paused except for dom0), I looked at the dom0 hypervisor log, and it said things like:

[2022-02-17 20:11:38] (XEN) Enabling non-boot CPUs ...
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU1 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU2 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU3 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU4 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU5 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU6 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU7 up: -5

And then when I manually unpause one VM, the first VM I unpaused crashes in xen. The hypervisor log is like:

[2022-02-17 20:11:39] (XEN) d1v0 Unexpected vmexit: reason 3
[2022-02-17 20:11:39] (XEN) domain_crash called from vmx.c:4304
[2022-02-17 20:11:39] (XEN) Domain 1 (vcpu#0) crashed on cpu#0:
[2022-02-17 20:11:39] (XEN) ----[ Xen-4.14.3 x86_64 debug=n Not tainted ]----
[2022-02-17 20:11:39] (XEN) CPU: 0
[2022-02-17 20:11:39] (XEN) RIP: 0010:[<ffffffff890023a8>]
[2022-02-17 20:11:39] (XEN) RFLAGS: 0000000000000002 CONTEXT: hvm guest (d1v0)
[2022-02-17 20:11:39] (XEN) rax: 0000000000000001 rbx: ffffa55ac015be74 rcx: 00000000ffffffff
[2022-02-17 20:11:39] (XEN) rdx: 0000000000000000 rsi: ffffa55ac008fe44 rdi: 0000000000000002
[2022-02-17 20:11:39] (XEN) rbp: ffffa55ac015bdf4 rsp: ffffa55ac008fe38 r8: 00000897ac38a9f8
[2022-02-17 20:11:39] (XEN) r9: 0000000000000002 r10: 0000000000000000 r11: 0000000000000000
[2022-02-17 20:11:39] (XEN) r12: 0000000000000000 r13: 0000000000000000 r14: 0000000000000000
[2022-02-17 20:11:39] (XEN) r15: 0000000000000003 cr0: 0000000080050033 cr4: 0000000000770ef0
[2022-02-17 20:11:39] (XEN) cr3: 0000000002748001 cr2: 00006202870c6070
[2022-02-17 20:11:39] (XEN) fsb: 0000000000000000 gsb: ffff94552f600000 gss: 0000000000000000
[2022-02-17 20:11:39] (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0018 cs: 0010

Pause and unpause VM is not problematic at all when S3 sleep is not involved. Therefore I believe that xen must have some problem dealing with linux S3 sleep. Error bringing CPU1 up: -5 and d1v0 Unexpected vmexit: reason 3 indicating two problems inside xen sleep logic.

I browsed xen 4.14.4 source code shallowly and find out that:

-5 == -EIO
3 == EXIT_REASON_INIT

I am still having no clue about why the first vm resumed got killed because it vmexit with EXIT_REASON_INIT. However for the EIO part, it seems to originate from xen/arch/x86/smpboot.c:609

        if ( cpu_state == CPU_STATE_CALLIN )
        {
            /* number CPUs logically, starting from 1 (BSP is 0) */
            Dprintk("OK.\n");
            print_cpu_info(cpu);
            synchronize_tsc_master(cpu);
            Dprintk("CPU has booted.\n");
        }
        else if ( cpu_state == CPU_STATE_DEAD )
        {
            smp_rmb();
            rc = cpu_error;
        }
        else
        {
            boot_error = 1;
            smp_mb();
            if ( bootsym(trampoline_cpu_started) == 0xA5 )
                /* trampoline started but...? */
                printk("Stuck ??\n"); // <= THIS LINE
            else
                /* trampoline code not run */
                printk("Not responding.\n");
        }

so the other CPU has timed out or they did not correctly set the cpu state into CPU_STATE_CALLIN.

Maybe it is better to let a xen expert diagnose further.

Again, my CPU is i5-1135G7 and I enabled Linux S3 in my BIOS.

andyhhp commented 2 years ago

~The line of -EIO's to begin with means the APs aren't responding to an INIT-SIPI-SIPI sequence to start them up. Whatever the firmware has done (or not done), they're not in a working state.~ Edit: Not true, it turns out.

The VM "crashing" actually has nothing to do with the VM, or Xen really. VMEXIT_REASON_INIT means "an INIT IPI has arrived on this CPU at some point since your last VMEntry".

Do you have TXT enabled in firmware? The purpose of the VMEXIT_REASON_INIT comes from TXT and means "scrub your secrets from RAM, then reset". It's not supported by Xen, and if it were, you'd have a (hopefully clean) restart, rather than a resume. Also, in TXT, the rules for booting APs change.

logoerthiner1 commented 2 years ago

The line of -EIO's to begin with means the APs aren't responding to an INIT-SIPI-SIPI sequence to start them up. Whatever the firmware has done (or not done), they're not in a working state.

The VM "crashing" actually has nothing to do with the VM, or Xen really. VMEXIT_REASON_INIT means "an INIT IPI has arrived on this CPU at some point since your last VMEntry".

Do you have TXT enabled in firmware? The purpose of the VMEXIT_REASON_INIT comes from TXT and means "scrub your secrets from RAM, then reset". It's not supported by Xen, and if it were, you'd have a (hopefully clean) restart, rather than a resume. Also, in TXT, the rules for booting APs change.

Finally, a xen expert! Although I do not understand completely the detail, I tried disabling Intel PTT in BIOS and different things happens: when waking up, the power button light blinking time (the time for attempting to wake up other CPU cores) reduces, and then the computer immediately collapses (power light is on, screen is black, HDD halts, no response).

The stuck time reduced

Waking up does not write log file at all. I misunderstood the logs before.

(I will try once more later)

I tried a second time, and the computer bricked. It seems as if xen has overwritten my boot procedure into nop. Unplug the power plug make my computer go back to normal.

Anyway it seems that after I disabled intel PTT in my BIOS, suspend-to-RAM completely broke since the computer do not even wake up to qubes os.

I believe that when I attempt to wake up my computer, xen first try to wake up CPU1, and then found something crazy and reboot, which make the computer into a very dangerous state.

xenixxx commented 2 years ago

Try to remove xscreensaver from dom0, if its work problem is locker. In my x230 and X1 Carbon 5th this happen when i install light-locker and remove xscreensaver package then set general/LockCommand to "dm-tool switch-to-greeter".

logoerthiner1 commented 2 years ago

Try to remove xscreensaver from dom0, if its work problem is locker. In my x230 and X1 Carbon 5th this happen when i install light-locker and remove xscreensaver package then set general/LockCommand to "dm-tool switch-to-greeter".

I guess that Ctrl-Alt-F2 should always work if problem limits to xscreensaver. I have tried Ctrl-Alt-F2 and it does not work at all.

xenixxx commented 2 years ago

No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0)

logoerthiner1 commented 2 years ago

No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0)

I do not think this would work since I can hear hardware resetting when I attempt to wake up - even a malfunctioning screen saver do not reset the machine and the hardwares; by the way are you talking about R4.1?

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

xenixxx commented 2 years ago

No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0)

I do not think this would work since I can hear hardware resetting when I attempt to wake up - even a malfunctioning screen saver do not reset the machine and the hardwares; by the way are you talking about R4.1?

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

This error occurs in versions 4.0 and 4.1. I only found one solution. Remove xscreensaver from dom0, install lightlockar as lock program and set lock command in xfce4. The worst part is the randomness of this error. Laptop wakes up for a week, then freezes and doesn't get up. After that, there was no problem anymore. It definitely occurs in lenvo laptops. I have not checked the others. It does not matter whether the bios coreboot for x230 or bios lenovo for x1 carbon. The hardware may go through a random number of sleep / wake cycles and one was sure it would crash at some point. After changing to ligtlocker, everything has been working steadily for a year.

xenixxx commented 2 years ago

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes.

logoerthiner1 commented 2 years ago

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes.

Removing any package in dom0 is a dangerous action since dom0 may not boot any longer (imagine you accidentally uninstalled libc; linux distribution installs package stably but uninstalling is less tested and may lead to problem). I would rather set the lock screen command to be empty in order to find out whether the screen locker is the culprit.

xenixxx commented 2 years ago

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes.

Removing any package in dom0 is a dangerous action since dom0 may not boot any longer (imagine you accidentally uninstalled libc; linux distribution installs package stably but uninstalling is less tested and may lead to problem). I would rather set the lock screen command to be empty in order to find out whether the screen locker is the culprit.

It's funny. the xscreensaver package cannot and will not conflict in dom0. Sure you can look for a solution. Run xscreensaver -v -log /home/user/diaglog.txt and you might be able to diagnose it exactly. In my case, looking for the problem and a few weekends did not solve the problem.

andyhhp commented 2 years ago

@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition.

Also, you said that Windows is fine with this. What about plain Linux?

DemiMarie commented 2 years ago

@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition.

Does this mean this is blocked on S0ix support in Xen? Is there any chance that Xen will support S0ix in combination with PCIe pass-through with not-fully-trusted guests?

logoerthiner1 commented 2 years ago

@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition.

The device I am talking about is a Thinkpad L15 Gen 2 (This should be visible in the debug info; anyway).

I have updated the BIOS and updated BIOS provides an option of Linux S3. This is the BIOS Update documentation: https://download.lenovo.com/pccbbs/mobiles/r1juj10w.txt , and "Linux S3" can be found. Once I enabled "Linux S3", both windows and Qubes OS are able to suspend as expected (fan is not working in Linux S3), but only windows can correctly wake up.

Personally I HATE S0ix like many other people - how is S0ix different from just locking the screen, pausing every application, and close the monitor? (Actually in Qubes OS I find that I can set the screen brightness into absolute zero which may be either deemed a bug or a feature)

Also, you said that Windows is fine with this. What about plain Linux?

I have not installed a plain linux there ... If this information is of great importance, I may try a ubuntu booting media (dd an Ubuntu 21.10 installer onto a usb drive, for example) or something similar. Any other thoughts on testing or debugging?

Also it may be useful to test whether windows work when I disable Intel PTT. I will test it later.

marmarek commented 2 years ago

I have not installed a plain linux there ... If this information is of great importance, I may try a ubuntu booting media (dd an Ubuntu 21.10 installer onto a usb drive, for example) or something similar. Any other thoughts on testing or debugging?

You can try plain Linux by... booting Qubes without Xen - in grub drop multiboot2 line, then replace subsequent two module2 lines with linux and initrd. Obviously no VM would start, but it should allow you testing S3 with plain Linux.

marmarek commented 2 years ago

I have a different system that behaves in a very similar way.

What about plain Linux?

It works here.

``` [ 57.364586] PM: suspend entry (deep) [ 57.366885] Filesystems sync: 0.002 seconds [ 57.390458] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 57.391909] OOM killer disabled. [ 57.391910] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. [ 57.392894] printk: Suspending console(s) (use no_console_suspend to debug) [ 57.827843] PM: suspend devices took 0.435 seconds [ 57.864540] ACPI: EC: interrupt blocked [ 57.899168] ACPI: PM: Preparing to enter system sleep state S3 [ 57.900093] ACPI: EC: event blocked [ 57.900094] ACPI: EC: EC stopped [ 57.900095] ACPI: PM: Saving platform NVS memory [ 57.900097] Disabling non-boot CPUs ... [ 57.901071] IRQ 137: no longer affine to CPU1 [ 57.902105] smpboot: CPU 1 is now offline [ 57.904231] IRQ 138: no longer affine to CPU2 [ 57.905260] smpboot: CPU 2 is now offline [ 57.906806] IRQ 139: no longer affine to CPU3 [ 57.907828] smpboot: CPU 3 is now offline [ 57.909436] IRQ 140: no longer affine to CPU4 [ 57.911189] smpboot: CPU 4 is now offline [ 57.912457] IRQ 141: no longer affine to CPU5 [ 57.913469] smpboot: CPU 5 is now offline [ 57.914502] IRQ 142: no longer affine to CPU6 [ 57.916231] smpboot: CPU 6 is now offline [ 57.917358] IRQ 143: no longer affine to CPU7 [ 57.918371] smpboot: CPU 7 is now offline [ 57.925596] ACPI: PM: Low-level resume complete [ 57.925693] ACPI: EC: EC started [ 57.925693] ACPI: PM: Restoring platform NVS memory [ 57.926746] Enabling non-boot CPUs ... [ 57.926802] x86: Booting SMP configuration: [ 57.926803] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 57.929006] CPU1 is up [ 57.929033] smpboot: Booting Node 0 Processor 2 APIC 0x2 [ 57.929973] CPU2 is up [ 57.929993] smpboot: Booting Node 0 Processor 3 APIC 0x3 [ 57.930902] CPU3 is up [ 57.930930] smpboot: Booting Node 0 Processor 4 APIC 0x4 [ 57.932052] CPU4 is up [ 57.932071] smpboot: Booting Node 0 Processor 5 APIC 0x5 [ 57.933033] CPU5 is up [ 57.933053] smpboot: Booting Node 0 Processor 6 APIC 0x6 [ 57.934263] CPU6 is up [ 57.934282] smpboot: Booting Node 0 Processor 7 APIC 0x7 [ 57.935311] CPU7 is up [ 57.939993] ACPI: PM: Waking up from system sleep state S3 [ 57.942288] ACPI: EC: interrupt unblocked [ 57.967645] ACPI: EC: event unblocked [ 57.968000] usb usb1: root hub lost power or was reset [ 57.968005] usb usb2: root hub lost power or was reset [ 58.101944] iwlwifi 0000:00:14.3: RF_KILL bit toggled to enable radio. [ 58.114921] nvme nvme0: Shutdown timeout set to 10 seconds [ 58.116798] nvme nvme0: 8/0/0 default/read/poll queues [ 58.199513] usb 3-10: reset full-speed USB device number 4 using xhci_hcd [ 58.341506] PM: resume devices took 0.374 seconds [ 58.341801] OOM killer enabled. [ 58.341803] Restarting tasks ... done. ```
logoerthiner1 commented 2 years ago

Also it may be useful to test whether windows work when I disable Intel PTT. I will test it later.

When disabling PTT, in Windows I also cannot wake up from suspension, and the phenomenon is the same (it cannot wake up and long press the power button to power off and power on does not work; unplug the AC and power on sometimes recover the system). Enabling PTT again does not work. It is when I reset my BIOS to default config that I get suspension on windows to work again. The lesson is that Intel PTT is not a thing to consider disabling when we are thinking about suspension.

However when I reset my BIOS to default config, Qubes OS line disappeared in boot order and I cannot boot from the HDD that I install Qubes OS on (SSD installs the Windows, HDD installs the Qubes OS; SSD boots and HDD does not; earlier I always boot Qubes OS with a dedicated boot term named "Qubes OS", but when I reset the BIOS, it disappears and I cannot boot Qubes OS).

I may need to take time finding a rescue disk and fix the Qubes OS booting according to https://www.qubes-os.org/doc/uefi-troubleshooting/ .

Update1: Fixed. Actually the efibootmgr command in https://www.qubes-os.org/doc/uefi-troubleshooting/ has been out-dated since Qubes OS now uses grub.

logoerthiner1 commented 2 years ago

You can try plain Linux by... booting Qubes without Xen - in grub drop multiboot2 line, then replace subsequent two module2 lines with linux and initrd. Obviously no VM would start, but it should allow you testing S3 with plain Linux.

Could you please bother explain the steps in detail? I am a newbie to hacking grub things. @marmarek I have tried:

  1. in grub, press E to edit the selection
  2. comment out the multiboot2 line and edit the module2 lines with linux /vmlinuz-5.10.96-1.fc32.qubes.x86_64 (with or without the later trailing arguments) initrd /initramfs-5.10.96-1.fc32.qubes.x86_64.img And I see dracut: FATAL: Cannot unbind PCI devices and dracut: Refusing to continue and then finally reboot: System halted
logoerthiner1 commented 2 years ago

Also, you said that Windows is fine with this. What about plain Linux?

I tried Ubuntu 21.10 on ISO and found out that Linux S3 works. @andyhhp

My another discovery here is that the wireless card also work in Ubuntu 21.10, even after suspending and waking up (suspension panics sys-net; Ubuntu 21.10 has kernel 5.13 while qubes has kernel 5.15, why would a 5.15 vm crash repeatably while a 5.13 environment be good and stable?)

logoerthiner1 commented 2 years ago

For issue (1) I have tried current-testing kernel-latest-qubes-vm and it is still crashing, so I will open another issue for it.

marmarek commented 2 years ago

It works here.

Furthermore, on plain Linux after S3 KVM still works.

logoerthiner1 commented 2 years ago

It works here.

Furthermore, on plain Linux after S3 KVM still works.

Have you tested whether suspension works in Qubes R4.0? I installed R4.1 directly and I cannot afford installing R4.0 only for testing - it would take so much time on my computer. Also, R4.0 does not have a live usb for testing.

marmarek commented 2 years ago

I can try, but I'm pretty sure it won't work, if it manages to even boot there.

brendanhoar commented 2 years ago

Furthermore, on plain Linux after S3 KVM still works.

Including with PCIe pass through to HVM guests?

B

andyhhp commented 2 years ago

So good news. @marmarek managed to repro and diagnose that it was a problem with CET Shadow Stacks, and it is a bug in Xen. We've got a fix, which I'm cleaning up.

@logoerthiner1 Do you want crediting on the upstream bugfix, and if so, name and email for a Reported-by: tag. (e.g. https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=35727551c0703493a2240e967cffc3063b13d49c)

marmarek commented 2 years ago

@logoerthiner1 there is a test package with the (preliminary) fix included, you can get it via qubes-dom0-update --enablerepo=qubes-dom0-unstable --action=update xen-hypervisor. You should get xen-hypervisor-4.14.4-1.57.fc32.x86_64. Can you try?

I recommend backing up /boot/xen-4.14.4.gz first, just in case.

logoerthiner1 commented 2 years ago

@logoerthiner1 there is a test package with the (preliminary) fix included, you can get it via qubes-dom0-update --enablerepo=qubes-dom0-unstable --action=update xen-hypervisor. You should get xen-hypervisor-4.14.4-1.57.fc32.x86_64. Can you try?

I recommend backing up /boot/xen-4.14.4.gz first, just in case.

Quick response: (I have backuped everything related to xen in /boot as I am a newbie) the package installed correctly; when I shutdown and reboot the system, Qubes OS boots correctly despite it takes a bit longer time from grub to the HDD password screen; it suspends in around 6 seconds (a bit longer than original); when I press the power button to attempt to wake up, the computer wakes up nearly instantly (obviously no Stuck ??) but then it shutdowns, as if the suspending pauses at a shutdown process, and resuming resumes the shutdown process. Let me investigate further.

hypervisor.log says nothing after Enabling non-boot CPUs which means that wake up should work.

Update: try a second time; this time waking up, the xscreenlocker appears for 2 seconds and the whole qubes-os shutdowns. Each boot has disappeared in journalctl completely so I cannot see what the dmesg in dom0 is like.

It does not look like dom0 panics since when I press F1 in when shutdown, I see that dracut or something is working, block devices and luks are unmounting, and the steps are as normal.

Update2: When I close and open the lid of laptop in order to wake up, it still shutdowns on waking up. So the reason is not that the Qubes OS misinterprets the power button as a signal of shutdown.

Update3: The phenomenon is like when poweroff is executed in dom0.

Update4: I misinterpreted the time field in journalctl - the log switches timezone frequently; after finding out that journalctl is correctly working, I find out the possible culprit:

dom0 kernel: thermal thermal_zone3: critical temperature reached (128 C), shutting down

The actual sensor id (thermal_zone1, thermal_zone2, thermal_zone3) varies; 128 C is exact.

It is like a https://bugzilla.kernel.org/show_bug.cgi?id=201761 bug. Trying https://community.solid-run.com/t/thermal-thermal-zone0-critical-temperature-reached-95-c-shutting-down/92 . (First attempt: not working)

Update5: kernel-latest in dom0 solves the issue. Also it seems that the issue https://github.com/QubesOS/qubes-issues/issues/7294 is not urgent any more, since despite it panicking on every suspending, in new xen & kernel, sys-net can be restarted without other issues (earlier I cannot restart sys-net at all). So hopefully I may be able to actually start relying on my new machine now.

Thank you all @andyhhp @marmarek ! Great job!

@logoerthiner1 Do you want crediting on the upstream bugfix, and if so, name and email for a Reported-by: tag. (e.g. https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=35727551c0703493a2240e967cffc3063b13d49c)

If acceptable, Thiner Logoer (logoerthiner1 at 163.com).

marmarek commented 2 years ago

Qubes OS boots correctly despite it takes a bit longer time from grub to the HDD password screen

Check if you don't have any errors (especially related to CPU bring up) in xl dmesg on startup.

logoerthiner1 commented 2 years ago

Qubes OS boots correctly despite it takes a bit longer time from grub to the HDD password screen

Check if you don't have any errors (especially related to CPU bring up) in xl dmesg on startup.

The only line looks like an error is parameter "no-real-mode" unknown! which I think is not the problem; the interval from grub to the HDD password screen does not seem to be that long (a few seconds). Also the long interval might be because that I installed Qubes OS on a HDD.

I have not observed other problems really other than that the sys-net always panic on suspending, which is in another issue.

andyhhp commented 2 years ago

The only line looks like an error is parameter "no-real-mode" unknown! which I think is not the problem

Yeah, that's just noise and we've hidden it in more recent versions of Xen.

andyhhp commented 2 years ago

Upstream (slightly RFC) patches at https://lore.kernel.org/xen-devel/20220224194853.17774-2-andrew.cooper3@citrix.com/ but there's another race condition I found, so I suspect the fix is going to be rather more involved. The minimal fix done on this thread is fine in the interim.

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The package vmm-xen has been pushed to the r4.1 testing repository for the CentOS centos-stream8 template. To test this update, please install it with the following command:

sudo yum update --enablerepo=qubes-vm-r4.1-current-testing

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component vmm-xen (including package python3-xen-4.14.4-2.fc32) has been pushed to the r4.1 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The package xen_4.14.4-2 has been pushed to the r4.1 testing repository for the Debian template. To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing buster-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The package vmm-xen has been pushed to the r4.1 stable repository for the CentOS centos-stream8 template. To install this update, please use the standard update command:

sudo yum update

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The package xen_4.14.4-2+deb10u1 has been pushed to the r4.1 stable repository for the Debian template. To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

qubesos-bot commented 2 years ago

Automated announcement from builder-github

The component vmm-xen (including package python3-xen-4.14.4-2.fc32) has been pushed to the r4.1 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update