Closed logoerthiner1 closed 2 years ago
I update my discovery here.
I no-op'ed the internal.SuspendPost (something like this, at /usr/lib/python3.8/qubes/api/internal.py) logic and most of internal.SuspendPre (only keeping the suspend part and remove the SuspendPre message broadcasting) so that I can pause and unpause the VM manually before and after suspension, and then I suspend and wake up the machine.
The result is: waking up still takes too long time, and after waking up (all the VM are paused except for dom0), I looked at the dom0 hypervisor log, and it said things like:
[2022-02-17 20:11:38] (XEN) Enabling non-boot CPUs ...
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU1 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU2 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU3 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU4 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU5 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU6 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU7 up: -5
And then when I manually unpause one VM, the first VM I unpaused crashes in xen. The hypervisor log is like:
[2022-02-17 20:11:39] (XEN) d1v0 Unexpected vmexit: reason 3
[2022-02-17 20:11:39] (XEN) domain_crash called from vmx.c:4304
[2022-02-17 20:11:39] (XEN) Domain 1 (vcpu#0) crashed on cpu#0:
[2022-02-17 20:11:39] (XEN) ----[ Xen-4.14.3 x86_64 debug=n Not tainted ]----
[2022-02-17 20:11:39] (XEN) CPU: 0
[2022-02-17 20:11:39] (XEN) RIP: 0010:[<ffffffff890023a8>]
[2022-02-17 20:11:39] (XEN) RFLAGS: 0000000000000002 CONTEXT: hvm guest (d1v0)
[2022-02-17 20:11:39] (XEN) rax: 0000000000000001 rbx: ffffa55ac015be74 rcx: 00000000ffffffff
[2022-02-17 20:11:39] (XEN) rdx: 0000000000000000 rsi: ffffa55ac008fe44 rdi: 0000000000000002
[2022-02-17 20:11:39] (XEN) rbp: ffffa55ac015bdf4 rsp: ffffa55ac008fe38 r8: 00000897ac38a9f8
[2022-02-17 20:11:39] (XEN) r9: 0000000000000002 r10: 0000000000000000 r11: 0000000000000000
[2022-02-17 20:11:39] (XEN) r12: 0000000000000000 r13: 0000000000000000 r14: 0000000000000000
[2022-02-17 20:11:39] (XEN) r15: 0000000000000003 cr0: 0000000080050033 cr4: 0000000000770ef0
[2022-02-17 20:11:39] (XEN) cr3: 0000000002748001 cr2: 00006202870c6070
[2022-02-17 20:11:39] (XEN) fsb: 0000000000000000 gsb: ffff94552f600000 gss: 0000000000000000
[2022-02-17 20:11:39] (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0018 cs: 0010
Pause and unpause VM is not problematic at all when S3 sleep is not involved. Therefore I believe that xen must have some problem dealing with linux S3 sleep. Error bringing CPU1 up: -5
and d1v0 Unexpected vmexit: reason 3
indicating two problems inside xen sleep logic.
I browsed xen 4.14.4 source code shallowly and find out that:
-5 == -EIO
3 == EXIT_REASON_INIT
I am still having no clue about why the first vm resumed got killed because it vmexit with EXIT_REASON_INIT
. However for the EIO part, it seems to originate from xen/arch/x86/smpboot.c:609
if ( cpu_state == CPU_STATE_CALLIN )
{
/* number CPUs logically, starting from 1 (BSP is 0) */
Dprintk("OK.\n");
print_cpu_info(cpu);
synchronize_tsc_master(cpu);
Dprintk("CPU has booted.\n");
}
else if ( cpu_state == CPU_STATE_DEAD )
{
smp_rmb();
rc = cpu_error;
}
else
{
boot_error = 1;
smp_mb();
if ( bootsym(trampoline_cpu_started) == 0xA5 )
/* trampoline started but...? */
printk("Stuck ??\n"); // <= THIS LINE
else
/* trampoline code not run */
printk("Not responding.\n");
}
so the other CPU has timed out or they did not correctly set the cpu state into CPU_STATE_CALLIN.
Maybe it is better to let a xen expert diagnose further.
Again, my CPU is i5-1135G7 and I enabled Linux S3 in my BIOS.
~The line of -EIO's to begin with means the APs aren't responding to an INIT-SIPI-SIPI sequence to start them up. Whatever the firmware has done (or not done), they're not in a working state.~ Edit: Not true, it turns out.
The VM "crashing" actually has nothing to do with the VM, or Xen really. VMEXIT_REASON_INIT means "an INIT IPI has arrived on this CPU at some point since your last VMEntry".
Do you have TXT enabled in firmware? The purpose of the VMEXIT_REASON_INIT comes from TXT and means "scrub your secrets from RAM, then reset". It's not supported by Xen, and if it were, you'd have a (hopefully clean) restart, rather than a resume. Also, in TXT, the rules for booting APs change.
The line of -EIO's to begin with means the APs aren't responding to an INIT-SIPI-SIPI sequence to start them up. Whatever the firmware has done (or not done), they're not in a working state.
The VM "crashing" actually has nothing to do with the VM, or Xen really. VMEXIT_REASON_INIT means "an INIT IPI has arrived on this CPU at some point since your last VMEntry".
Do you have TXT enabled in firmware? The purpose of the VMEXIT_REASON_INIT comes from TXT and means "scrub your secrets from RAM, then reset". It's not supported by Xen, and if it were, you'd have a (hopefully clean) restart, rather than a resume. Also, in TXT, the rules for booting APs change.
Finally, a xen expert! Although I do not understand completely the detail, I tried disabling Intel PTT in BIOS and different things happens: when waking up, the power button light blinking time (the time for attempting to wake up other CPU cores) reduces, and then the computer immediately collapses (power light is on, screen is black, HDD halts, no response).
The stuck time reduced
Waking up does not write log file at all. I misunderstood the logs before.
(I will try once more later)
I tried a second time, and the computer bricked. It seems as if xen has overwritten my boot procedure into nop. Unplug the power plug make my computer go back to normal.
Anyway it seems that after I disabled intel PTT in my BIOS, suspend-to-RAM completely broke since the computer do not even wake up to qubes os.
I believe that when I attempt to wake up my computer, xen first try to wake up CPU1, and then found something crazy and reboot, which make the computer into a very dangerous state.
Try to remove xscreensaver from dom0, if its work problem is locker. In my x230 and X1 Carbon 5th this happen when i install light-locker and remove xscreensaver package then set general/LockCommand to "dm-tool switch-to-greeter".
Try to remove xscreensaver from dom0, if its work problem is locker. In my x230 and X1 Carbon 5th this happen when i install light-locker and remove xscreensaver package then set general/LockCommand to "dm-tool switch-to-greeter".
I guess that Ctrl-Alt-F2 should always work if problem limits to xscreensaver. I have tried Ctrl-Alt-F2 and it does not work at all.
No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0)
No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0)
I do not think this would work since I can hear hardware resetting when I attempt to wake up - even a malfunctioning screen saver do not reset the machine and the hardwares; by the way are you talking about R4.1?
I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.
No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0)
I do not think this would work since I can hear hardware resetting when I attempt to wake up - even a malfunctioning screen saver do not reset the machine and the hardwares; by the way are you talking about R4.1?
I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.
This error occurs in versions 4.0 and 4.1. I only found one solution. Remove xscreensaver from dom0, install lightlockar as lock program and set lock command in xfce4. The worst part is the randomness of this error. Laptop wakes up for a week, then freezes and doesn't get up. After that, there was no problem anymore. It definitely occurs in lenvo laptops. I have not checked the others. It does not matter whether the bios coreboot for x230 or bios lenovo for x1 carbon. The hardware may go through a random number of sleep / wake cycles and one was sure it would crash at some point. After changing to ligtlocker, everything has been working steadily for a year.
I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.
I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes.
I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.
I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes.
Removing any package in dom0 is a dangerous action since dom0 may not boot any longer (imagine you accidentally uninstalled libc; linux distribution installs package stably but uninstalling is less tested and may lead to problem). I would rather set the lock screen command to be empty in order to find out whether the screen locker is the culprit.
I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.
I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes.
Removing any package in dom0 is a dangerous action since dom0 may not boot any longer (imagine you accidentally uninstalled libc; linux distribution installs package stably but uninstalling is less tested and may lead to problem). I would rather set the lock screen command to be empty in order to find out whether the screen locker is the culprit.
It's funny. the xscreensaver package cannot and will not conflict in dom0. Sure you can look for a solution. Run xscreensaver -v -log /home/user/diaglog.txt and you might be able to diagnose it exactly. In my case, looking for the problem and a few weekends did not solve the problem.
@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition.
Also, you said that Windows is fine with this. What about plain Linux?
@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition.
Does this mean this is blocked on S0ix support in Xen? Is there any chance that Xen will support S0ix in combination with PCIe pass-through with not-fully-trusted guests?
@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition.
The device I am talking about is a Thinkpad L15 Gen 2 (This should be visible in the debug info; anyway).
I have updated the BIOS and updated BIOS provides an option of Linux S3. This is the BIOS Update documentation: https://download.lenovo.com/pccbbs/mobiles/r1juj10w.txt , and "Linux S3" can be found. Once I enabled "Linux S3", both windows and Qubes OS are able to suspend as expected (fan is not working in Linux S3), but only windows can correctly wake up.
Personally I HATE S0ix like many other people - how is S0ix different from just locking the screen, pausing every application, and close the monitor? (Actually in Qubes OS I find that I can set the screen brightness into absolute zero which may be either deemed a bug or a feature)
Also, you said that Windows is fine with this. What about plain Linux?
I have not installed a plain linux there ... If this information is of great importance, I may try a ubuntu booting media (dd an Ubuntu 21.10 installer onto a usb drive, for example) or something similar. Any other thoughts on testing or debugging?
Also it may be useful to test whether windows work when I disable Intel PTT. I will test it later.
I have not installed a plain linux there ... If this information is of great importance, I may try a ubuntu booting media (dd an Ubuntu 21.10 installer onto a usb drive, for example) or something similar. Any other thoughts on testing or debugging?
You can try plain Linux by... booting Qubes without Xen - in grub drop multiboot2
line, then replace subsequent two module2
lines with linux
and initrd
. Obviously no VM would start, but it should allow you testing S3 with plain Linux.
I have a different system that behaves in a very similar way.
What about plain Linux?
It works here.
Also it may be useful to test whether windows work when I disable Intel PTT. I will test it later.
When disabling PTT, in Windows I also cannot wake up from suspension, and the phenomenon is the same (it cannot wake up and long press the power button to power off and power on does not work; unplug the AC and power on sometimes recover the system). Enabling PTT again does not work. It is when I reset my BIOS to default config that I get suspension on windows to work again. The lesson is that Intel PTT is not a thing to consider disabling when we are thinking about suspension.
However when I reset my BIOS to default config, Qubes OS line disappeared in boot order and I cannot boot from the HDD that I install Qubes OS on (SSD installs the Windows, HDD installs the Qubes OS; SSD boots and HDD does not; earlier I always boot Qubes OS with a dedicated boot term named "Qubes OS", but when I reset the BIOS, it disappears and I cannot boot Qubes OS).
I may need to take time finding a rescue disk and fix the Qubes OS booting according to https://www.qubes-os.org/doc/uefi-troubleshooting/ .
Update1: Fixed. Actually the efibootmgr
command in https://www.qubes-os.org/doc/uefi-troubleshooting/ has been out-dated since Qubes OS now uses grub.
You can try plain Linux by... booting Qubes without Xen - in grub drop
multiboot2
line, then replace subsequent twomodule2
lines withlinux
andinitrd
. Obviously no VM would start, but it should allow you testing S3 with plain Linux.
Could you please bother explain the steps in detail? I am a newbie to hacking grub things. @marmarek I have tried:
module2
lines with linux /vmlinuz-5.10.96-1.fc32.qubes.x86_64
(with or without the later trailing arguments) initrd /initramfs-5.10.96-1.fc32.qubes.x86_64.img
And I see dracut: FATAL: Cannot unbind PCI devices
and dracut: Refusing to continue
and then finally reboot: System halted
Also, you said that Windows is fine with this. What about plain Linux?
I tried Ubuntu 21.10 on ISO and found out that Linux S3 works. @andyhhp
My another discovery here is that the wireless card also work in Ubuntu 21.10, even after suspending and waking up (suspension panics sys-net; Ubuntu 21.10 has kernel 5.13 while qubes has kernel 5.15, why would a 5.15 vm crash repeatably while a 5.13 environment be good and stable?)
For issue (1) I have tried current-testing kernel-latest-qubes-vm and it is still crashing, so I will open another issue for it.
It works here.
Furthermore, on plain Linux after S3 KVM still works.
It works here.
Furthermore, on plain Linux after S3 KVM still works.
Have you tested whether suspension works in Qubes R4.0? I installed R4.1 directly and I cannot afford installing R4.0 only for testing - it would take so much time on my computer. Also, R4.0 does not have a live usb for testing.
I can try, but I'm pretty sure it won't work, if it manages to even boot there.
Furthermore, on plain Linux after S3 KVM still works.
Including with PCIe pass through to HVM guests?
B
So good news. @marmarek managed to repro and diagnose that it was a problem with CET Shadow Stacks, and it is a bug in Xen. We've got a fix, which I'm cleaning up.
@logoerthiner1 Do you want crediting on the upstream bugfix, and if so, name and email for a Reported-by: tag. (e.g. https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=35727551c0703493a2240e967cffc3063b13d49c)
@logoerthiner1 there is a test package with the (preliminary) fix included, you can get it via qubes-dom0-update --enablerepo=qubes-dom0-unstable --action=update xen-hypervisor
. You should get xen-hypervisor-4.14.4-1.57.fc32.x86_64
. Can you try?
I recommend backing up /boot/xen-4.14.4.gz
first, just in case.
@logoerthiner1 there is a test package with the (preliminary) fix included, you can get it via
qubes-dom0-update --enablerepo=qubes-dom0-unstable --action=update xen-hypervisor
. You should getxen-hypervisor-4.14.4-1.57.fc32.x86_64
. Can you try?I recommend backing up
/boot/xen-4.14.4.gz
first, just in case.
Quick response: (I have backuped everything related to xen in /boot as I am a newbie) the package installed correctly; when I shutdown and reboot the system, Qubes OS boots correctly despite it takes a bit longer time from grub to the HDD password screen; it suspends in around 6 seconds (a bit longer than original); when I press the power button to attempt to wake up, the computer wakes up nearly instantly (obviously no Stuck ??
) but then it shutdowns, as if the suspending pauses at a shutdown process, and resuming resumes the shutdown process. Let me investigate further.
hypervisor.log says nothing after Enabling non-boot CPUs
which means that wake up should work.
Update: try a second time; this time waking up, the xscreenlocker appears for 2 seconds and the whole qubes-os shutdowns.
Each boot has disappeared in journalctl completely so I cannot see what the dmesg
in dom0
is like.
It does not look like dom0 panics since when I press F1 in when shutdown, I see that dracut or something is working, block devices and luks are unmounting, and the steps are as normal.
Update2: When I close and open the lid of laptop in order to wake up, it still shutdowns on waking up. So the reason is not that the Qubes OS misinterprets the power button as a signal of shutdown.
Update3: The phenomenon is like when poweroff
is executed in dom0.
Update4: I misinterpreted the time field in journalctl - the log switches timezone frequently; after finding out that journalctl is correctly working, I find out the possible culprit:
dom0 kernel: thermal thermal_zone3: critical temperature reached (128 C), shutting down
The actual sensor id (thermal_zone1
, thermal_zone2
, thermal_zone3
) varies; 128 C is exact.
It is like a https://bugzilla.kernel.org/show_bug.cgi?id=201761 bug. Trying https://community.solid-run.com/t/thermal-thermal-zone0-critical-temperature-reached-95-c-shutting-down/92 . (First attempt: not working)
Update5: kernel-latest
in dom0 solves the issue. Also it seems that the issue https://github.com/QubesOS/qubes-issues/issues/7294 is not urgent any more, since despite it panicking on every suspending, in new xen & kernel, sys-net can be restarted without other issues (earlier I cannot restart sys-net at all). So hopefully I may be able to actually start relying on my new machine now.
Thank you all @andyhhp @marmarek ! Great job!
@logoerthiner1 Do you want crediting on the upstream bugfix, and if so, name and email for a Reported-by: tag. (e.g. https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=35727551c0703493a2240e967cffc3063b13d49c)
If acceptable, Thiner Logoer (logoerthiner1 at 163.com).
Qubes OS boots correctly despite it takes a bit longer time from grub to the HDD password screen
Check if you don't have any errors (especially related to CPU bring up) in xl dmesg
on startup.
Qubes OS boots correctly despite it takes a bit longer time from grub to the HDD password screen
Check if you don't have any errors (especially related to CPU bring up) in
xl dmesg
on startup.
The only line looks like an error is parameter "no-real-mode" unknown!
which I think is not the problem; the interval from grub to the HDD password screen does not seem to be that long (a few seconds). Also the long interval might be because that I installed Qubes OS on a HDD.
I have not observed other problems really other than that the sys-net
always panic on suspending, which is in another issue.
The only line looks like an error is
parameter "no-real-mode" unknown!
which I think is not the problem
Yeah, that's just noise and we've hidden it in more recent versions of Xen.
Upstream (slightly RFC) patches at https://lore.kernel.org/xen-devel/20220224194853.17774-2-andrew.cooper3@citrix.com/ but there's another race condition I found, so I suspect the fix is going to be rather more involved. The minimal fix done on this thread is fine in the interim.
Automated announcement from builder-github
The package vmm-xen
has been pushed to the r4.1
testing repository for the CentOS centos-stream8
template.
To test this update, please install it with the following command:
sudo yum update --enablerepo=qubes-vm-r4.1-current-testing
Automated announcement from builder-github
The component vmm-xen
(including package python3-xen-4.14.4-2.fc32
) has been pushed to the r4.1
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The package xen_4.14.4-2
has been pushed to the r4.1
testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list
by uncommenting the line containing buster-testing
(or appropriate equivalent for your template version), then use the standard update command:
sudo apt-get update && sudo apt-get dist-upgrade
Automated announcement from builder-github
The package vmm-xen
has been pushed to the r4.1
stable repository for the CentOS centos-stream8
template.
To install this update, please use the standard update command:
sudo yum update
Automated announcement from builder-github
The package xen_4.14.4-2+deb10u1
has been pushed to the r4.1
stable repository for the Debian template.
To install this update, please use the standard update command:
sudo apt-get update && sudo apt-get dist-upgrade
Automated announcement from builder-github
The component vmm-xen
(including package python3-xen-4.14.4-2.fc32
) has been pushed to the r4.1
stable repository for dom0.
To install this update, please use the standard update command:
sudo qubes-dom0-update
Or update dom0 via Qubes Manager.
How to file a helpful issue
I installed R4.1 on my new ThinkPad L15 Gen2 and figured out the way to enable Linux S3 sleep. Linux S3 sleep works on windows, but on Qubes OS (actually I was surprised to see that suspension - waking up partially works) it encountered many problems.
Basically suspend (1) immediately panics my sys-net because of a MT7921 driver NULL pointer dereference happening inside kernel-latest (5.15.14-1.fc32), (2) randomly kill some of my running appVMs, and (3) cause a long time lag when waking up (~15s, from pressing power button to xscreensaver screen, with the power button light blinking as if it was still suspending)
I would like to try handling by myself (1), and (3) does not seems easy to figure out, so this issue focus on (2).
It might be related to #6411 since my CPU is i5-1135G7.
Qubes OS release
R4.1
Brief summary
Suspend randomly shutdown running VMs, by Xen:
Steps to reproduce
Suspend, and wake up
Expected behavior
After waking up, all running VM before suspension is still running.
Actual behavior
Some random VMs are shutdown. It seems like the VM is shutdown when waking up, but I cannot rule out that the crash happens when suspending
Log files for reference
https://pastebin.mozilla.org/84hYCS5v
Edit
I tried disabling HyperThreading, and it only cuts the lag time of (3) in half; VM still randomly exits.