Closed logoerthiner1 closed 1 year ago
The original fix says that there are rare cases that it still does not work; however does that case match the problem I am facing?
I meant rare at datacenter scale, not "can be reliably repro'd on a single system". Also, the way it would go wrong would be a crash some point after resume, so there would be logging from Xen
Update: each problem occur when waking up over 12 hours later from suspend; not sure whether very frequent suspend & wake up will cause problem.
If waiting 12h really does make the difference between it working and not, then this is almost certainly a Lenovo firmware bug.
I have recently reproduced the issue after 4~5 suspension within 1 hours (when I was testing about another thing) so this bug does not seem to rely on long timing; When I reproduced that, later I found that CPU / Fan is unreasonably hot, I am not sure whether it has anything to do with the issue.
After I inspect the log file more carefully I find out something different.
dom0 journalctl -r
tells that suspension is being done; Xen says that he did not know about the suspension.
(1) [2022-03-14 14:46:18] (XEN) Disabling non-boot CPUs ...
(2) [2022-03-14 14:46:18] (XEN) Broke affinity for IRQ137, new: ffff
(3) [2022-03-14 14:46:18] (XEN) Broke affinity for IRQ9, new: ffff
(4) [2022-03-14 14:46:18] (XEN) Broke affinity for IRQ16, new: ffff
(5) [2022-03-14 14:46:18] (XEN) Entering ACPI S3 state.
(6) [2022-03-14 14:46:18] (XEN) CPU0 CMCI LVT vector (0xf1) already installed
(7) [2022-03-14 14:46:18] (XEN) Finishing wakeup from ACPI S3 state.
(8) [2022-03-14 14:46:18] (XEN) Enabling non-boot CPUs ...
For example, each suspend-resume loop will be shown in Xen log with the line 1-8.
However when the fatal suspension (the suspension that never wakes up) happens, Xen log does not even append (1) into its log file, not mentioning later lines. Dom0 log has something related to that sleep:
Mar 14 14:47:08 dom0 systemd-sleep[22555]: Suspending system...
Mar 14 14:47:08 dom0 systemd[1]: Starting Suspend...
Mar 14 14:47:08 dom0 kernel: audit: type=1130 audit(1647240428.975:337): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=qubes-suspen
Mar 14 14:47:08 dom0 systemd[1]: Reached target Sleep.
I have observed similar logging patterns in the other fatal suspension cases.
Any instructions on further investigations? Or any clue on why xen does not log about the fatal suspension?
I can reproduce it on my TGL system, with totally different firmware. I have a serial console, and not much more details there. Xen stops responding to triple ctrl-a when in hangs.
May I ask whether you have any progress on investigation of the issue? @marmarek
Update: I have updated with qubes-dom0-update
and after the update (I believe mostly thanks to firmware update) I have tried with dom0 kernel 5.16.13 but it does not seem stable (or maybe my mistake), then I tried 5.15.14 as dom0 kernel, and experimented repeatedly suspending & resuming. I tried ~15 times and it seems to work.
Since the bug appears very occasionally I will try later to check out whether the new update really fixes the issue. A week or two, if this issue does not pop out, I will believe that this issue is solved.
Update: I have updated with
qubes-dom0-update
and after the update (I believe mostly thanks to firmware update) I have tried with dom0 kernel 5.16.13 but it does not seem stable (or maybe my mistake), then I tried 5.15.14 as dom0 kernel, and experimented repeatedly suspending & resuming. I tried ~15 times and it seems to work.
Would you be willing to try kernel and/or firmware packages from qubes-dom0-current-testing
? They are fully security supported, so you should not be making your system less secure by doing so.
[quote]
Would you be willing to try kernel and/or firmware packages from qubes-dom0-current-testing
? They are fully security supported, so you should not be making your system less secure by doing so.
[/quote]
Is this true?
I don't know what "security supported" means, but surely the
point in having current-testing is to identify any issues (including
security issues) before packages hit current.
It's at least possible that packages in current-testing have introduced
security issues, no?
Would you be willing to try kernel and/or firmware packages from
qubes-dom0-current-testing
? They are fully security supported, so you should not be making your system less secure by doing so.Is this true? I don't know what "security supported" means, but surely the point in having current-testing is to identify any issues (including security issues) before packages hit current. It's at least possible that packages in current-testing have introduced security issues, no?
According to a past email from Marek a security vulnerability in current-testing would result in a QSB being issued, even if the vulnerable package has not hit stable yet.
According to a past email from Marek a security vulnerability in current-testing would result in a QSB being issued, even if the vulnerable package has not hit stable yet.
Yes.
Quick update: This issue rarely persists now - and the case has changed. There is one time recently that, when I clicked on "Suspend", the power light blinks and blinks forever.
Latest experiment:
I have tried S3 sleep on my machine in both Windows 10 and Ubuntu 22.04 Live ISO, and both works without any problem and are fairly stable. (Note: their suspension does not contain a blinking power light)
Qubes OS (where S3 sleep is supported in Xen) still occasionally fails (once in 10 times or 20 times), either failed to suspend or to resume. Therefore it should be a Xen problem rather than a firmware problem.
xenpm set-scaling-governor powersave
or even append cpufreq=xen:powersave
helps a bit but cannot completely eliminate the failure.
I have even tried suspension right before logging in, at tty2. I keep systemctl suspend
over and over and Qubes OS seems to survive ~10 times of suspension loop and then dies.
Death can occur when suspending or when resuming.
(1) When the machine dies on suspending, the power light keeps blinking forever, and numlock does not work, machine does not respond to anything. (2) When the machine dies on resuming, there is no sign at suspension, but when I press the power button to resume the computer, the power light starts to light up and then keep the light, the numlock does not respond, and the machine does not respond to any actions.
This instability of S3 sleep of Qubes OS R4.1 have caused great pain on frequent LVM corruption and caused some data loss due to unsaved progress during the half year. I sincerely hope that developers can investigate further to solve the problem.
@marmarek may I ask the progress of the investigation of the bug? Can you reproduce this bug on more other machines, using Intel Tigerlake CPU or not, by repeatedly suspending & resuming? Is it related to the new xen version? Are there anything I can do that can be of help for the bug?
frequent LVM corruption
Please file a separate bug for this. LVM volumes should not be corrupted even if the power is yanked in the middle of an operation.
frequent LVM corruption
Please file a separate bug for this. LVM volumes should not be corrupted even if the power is yanked in the middle of an operation.
I do not have a reproduceable way of triggering the exact same behavior nor do I want to try reproducing it on my working machine considering I do not have a zoo of machines. However similar bugs might need to be considered, when thinking about stability of Qubes OS on various hardware failures. I will try filing such "bugs".
Update: the separate bug (if it is) is #7800
@marmarek
This bug has bothered me for half a year, I can reliably trigger the bug every day, but I do not see any later discussion on the bug itself. May I ask the current status of the bug?
sudo systemctl suspend
)I don't have this issue on any other system, TGL I have (Framework) doesn't have any issues with S3 anymore. You may want to try Xen 4.14.5-11 (https://github.com/QubesOS/updates-status/issues/3142), it contains few suspend-related fixes (although not exactly matching your issue description).
I don't have this issue on any other system, TGL I have (Framework) doesn't have any issues with S3 anymore. You may want to try Xen 4.14.5-11 (QubesOS/updates-status#3142), it contains few suspend-related fixes (although not exactly matching your issue description).
I have tried the Xen 4.14.5-11 by downloading manually the xen-hypervisor rpm and put its xen-4.14.5.gz onto /boot/xen-4.14.5-11.gz
and edit the boot command. It does not work either.
No additional log is generated when the suspend problem happens, so I believe that it is likely that the computer is trapped into a ring -1 self-loop or anything like this.
I wonder whether there are some suggested ways of finding out what happened when the computer freezes out, for example a watchdog reside in xen that can generate a coredump for further analysis.
Do you have any suggestions? @marmarek @andyhhp @DemiMarie
I have tried the Xen 4.14.5-11 by downloading manually
Please use the whole packages (it's in the current-testing repository) - the changes are not only about the hypervisor binary.
You can try to get console output to get more details. If your system doesn't have real serial console, but has USB3 controller, see https://github.com/QubesOS/qubes-issues/issues/6834#issuecomment-1296221396. You probably want to add also loglvl=all
to get more details.
@marmarek
I have tried adding dbgp=xhci@pci00:14.0,share=yes console=vga,xhci loglvl=all
to Xen's cmdline (I removed the previous console=none
part of original command line; xen 4.14.5-10) and then the computer freezes, not responding to any keyboard presses, exactly like what I have encountered when suspension fails.
Such computer freezing appears in other multiple cases (even once when I was shutdowning the computer) so I am now suspecting that whenever xen panics, the computer freezes that way.
I have been playing with the params with no luck.
Do you have any ideas?
I have finally reach the point where I can debug the xen on another machine. There are many details that I may elaborate in the forum later as a guide. Here is for one suspend problem: The problem that when machine suspends, the power light blinks forever.
TEST CONDITION: I test right when tty1 gets into the login prompt, and I did not even log in at tty1; rather I log in at tty2 and repeat sudo systemctl suspend
.
Also I have been running sudo dmesg -w
at the xhci debug console.
Xen version is 4.14.5-10, kernel version is latest 6.0.2-2, both are the stable versions in qubes repo.
First this is a normal resume-suspend loop. Such loop occurs most of the times
Second is a resume-suspend loop BEFORE the culprit suspend (it has problems but the suspend succeeds; the next suspend fails)
Third is the suspend failure.
xen suspend prompt does not appear.
Fourth is the xen debug info. I find out that, when the computer hangs, xen is responsive. I can triple ctrl-a
to access xen debug utilities:
And then xen hangs here. It seems that CPU4 (dom0 vcpu2) is in a hazard state and when xen is trying to dump dom0 vcpu2, xen hangs - the console does not respond to any further commands, but the debug device is still on.
It would be of great help if you can help analyze the issue. @marmarek @DemiMarie
This is only one of the failures; I will try to grab more. Let me know if more info is needed.
This is either a Xen bug or a Linux kernel bug, but I am not familiar enough with low-level matters to go further.
First, you probably want a version of Xen with the S3/timer issue fixed. @marmarek Do you have a build to hand?
I'm not sure that will make a difference in this case, but let's rule things out one at a time. I expect this is a Linux bug.
One thing does look curious. In the case that everything is wedged, Xen does (initially) respond to debugkeys. When using the 'd' key, we get
(XEN) *** Dumping CPU4 guest state (d0v2): ***
<snip>
(XEN) Fault while accessing guest memory.
which in principle is fine - that looks like it hit a page boundary, and the adjacent page can indeed be unmapped. But moments later when using '0', we get
(XEN) *** Dumping Dom0 vcpu#2 state: ***
and at this point, you say Xen ceases responding to anything? It's certainly suspicious that it's the same vCPU that hit the page boundary.
I know it's tangential to the issue you're trying to debug, but can you set up a watchdog (simply watchdog
on Xen's command line) and repro this hang? I bet something in the '0' debugkey has gotten into a livelock, and if this is the case, it should be broken 5s later by the NMI watchdog, with a backtrace.
A second trace Nearly the same setup; after a 40~50 times of successful suspend & resume, the computer seems to have problem on resuming - finally it recovered.
First, you probably want a version of Xen with the S3/timer issue fixed. @marmarek Do you have a build to hand?
I'm not sure that will make a difference in this case, but let's rule things out one at a time. I expect this is a Linux bug.
One thing does look curious. In the case that everything is wedged, Xen does (initially) respond to debugkeys. When using the 'd' key, we get
(XEN) *** Dumping CPU4 guest state (d0v2): *** <snip> (XEN) Fault while accessing guest memory.
which in principle is fine - that looks like it hit a page boundary, and the adjacent page can indeed be unmapped. But moments later when using '0', we get
(XEN) *** Dumping Dom0 vcpu#2 state: ***
and at this point, you say Xen ceases responding to anything? It's certainly suspicious that it's the same vCPU that hit the page boundary.
Yes the xhci console does not respond to any keypress but the console itself is on (not disconnected from debugee side).
I know it's tangential to the issue you're trying to debug, but can you set up a watchdog (simply
watchdog
on Xen's command line) and repro this hang? I bet something in the '0' debugkey has gotten into a livelock, and if this is the case, it should be broken 5s later by the NMI watchdog, with a backtrace.
Glad to learn the "watchdog" parameter. I will have a try. As the suspend issue can be very complex and composed of several separated issue, I will not be surprised if one issue is related to a lock-up xen (actually this is what I have been suspecting).
@marmarek @andyhhp
One more crash.
TEST CONDITION: same (xen 4.14.5-10, kernel 6.0.2-2), except that "watchdog watchdog_timeout=20" is added, and that I have run a python3 -c "while 1:pass"
on tty3 (see #7795, since it will make dom0 unstable and more prone to the bug; I have been successfully sleep & resume for ~60 times when dom0 is idle; I have only used 100% of cpu while I have 800% total now); smp=on (if it is really necessary I can disable; however I have tried to turn it off with no luck)
OBSERVATION: the machine fails to resume; serial is scrambled at first; xen respond first but locked up when it attempt to dump the dom0 core. watchdog is triggered
@marmarek @DemiMarie @andyhhp
TEST CONDITION: same with one core python3 -c "while 1:pass"
OBSERVATION: after I entered sudo systemctl suspend
, the computer is not responsive for a while (it seems to be a kernel lag),
and it reports that ext4 write failed.
and then I suspend here and the computer lag a while and then:
And after a while I attempt to triple ctrl-a, it works and surprisingly triple ctrl-a and '*' (print all diagnostics) solves the kernel freeze and let the computer sleep (xen "print all diagnostics" have then solved dom0 hang many times - I suspect that this could be hint from some workaround or reason of soft lock):
Followups:
When the computer resume, the screen is on, but computer does not respond to keyboard; the xhci side can be switched between xen and kernel, 'h' works, but on '*' the xen is stuck and watchdog does not help either (watchdog is always on btw)
hdd does not seem to be spinning as when I hard reset the machine, there is no sudden hdd retract sound (when xen reset the machine, hdd will retract suddenly)
Followup2
A similar issue happens on suspend/resume in the next boot. A dom0 kernel xen_safe_halt CPU soft lockup stack trace appears in dmesg (UPDATE: this trace appeared again and many times with nearly the same stack trace on one next experiment):
Followup3
After a resume-suspend, the machine locks up unable to perform the next resume; xen fails to dump the dom0 culprit core either.
By the way here is a kernel trace that always appear on the first suspend of machine - maybe better give it a fix in order to eliminate the noise (this error always taints the kernel but it does not seems to be a big issue)
When I am using picocom, sometimes picocom will initially send something automatically through the console (usually the first ~120 characters that picocom receives on that session) and the terminal is a mess. It is a disaster when the initial receiver of console is xen rather than dom0, since the data sent back is parsed as commands, so xen get crazy and prints a lot of stuff and finally when he happens to parse an 'R', the computer halts. @marmarek do you have any ideas how to avoid this case? This behavior of picocom has been annoying to me.
TEST CONDITION: same with one core python3 -c "while 1:pass" OBSERVATION: after I entered sudo systemctl suspend, the computer is not responsive for a while, and when I tried xen dumping dom0 regs, xen (console part) lock ups while linux kernel resumes:
After a while (I executed sudo dmesg
in tty2), in console the xen resumes and computer quickly sleeps:
I wonder how xen command '0' works - it seems that sometimes when dom0 hangs, '0' can solve the hang.
Followup:
I played with the command '0' and it seems to solved the hang;
I played with the command '0' again when dom0 is responsive, and then xen hangs with a real backtrace.
And the computer does not reboot despite xen says to reboot in 5 secs.
I have tried the whole day on kernel 6.0.2-2 but still fails to trigger the resume failure case (I have only trigger the suspension failure). I switched to 5.15.74 and after a few (around ten?) tries I finally triggered the exact bug (I believe). Fortunately (is it?) the bug behavior is similar to the older cases - after a triple ctrl-a, xen responds to key press, however when I press '0', xen stuck on the death dom0 core.
On resume, xen writes to the serial port, but the dom0 is not waken up and produce no log. Detail:
and then the xen serial port stucks forever (it is not disconnected though; xen only disconnect the xhci serial port when machine suspends; when it stucks, the port is always connected)
watchdog DOES NOT pop out this time despite I have checked the log that I have added it into xen command line parameters.
Anyway I have 90% confidence that this problem is exactly what I am facing and what I have been headaching with through the half year. I will try once more on 5.15.74 to confirm this.
Followup: I tried a second time and confirm that the bug is reproduceable and can be characterized. Later I will describe the bug concisely.
Here is a better log - this time xen watchdog is working and the computer reboots.
@andyhhp @marmarek @DemiMarie
Summary about the bug.
The bug is complex and can be triggered in various ways.
TRIGGER: on my Thinkpad L15 Gen2, the bug can be triggered on 6.0.2 and 5.15.74, in various ways.
On 6.0.2, when I repeatedly suspend and resume, there will usually be one time when suspend fails, and one xen CPU (CPU i) runs as some dom0 vcpu (d0 vj) forever, and when I typed '*' in xen console, xen tries to find out the dom0 register of the j-th dom0 vcpu, and it hangs there and usually this hang can be caught by watchdog. Dom0 usually does not respond to anything. (When I do not run anything, this bug is hard to trigger, though not impossible. However in real life dom0 usually runs various stuff, including xorg and guid and audio daemon and storage drivers, and any of it could make suspension problem more likely to appear.)
On 5.15.74, when I repeatedly suspend and resume, there will usually be one time when resume fails, same behavior (one xen cpu runs as one dom0 vcpu forever). Dom0 usually does not respond to anything. This is the original issue I am mentioning, and should apply from a very early 5.15 kernel version to this version.
On 6.0.2, when I run python3 -c "while 1:pass"
on another terminal, and then suspend and resume, xen CPU hang (on suspension, but before dom0 finally does the final step of sleeping) issue appear more frequently, however such hang issue can be solved by xen command '0' - dump dom0 register.
There are a great number of log files today along with tons of at's, I am sorry if this bothers you. Let me know what I can do more.
Later I will go for the stuck RIP of d0vj and try to find out the calling stack. Update: I give up on this. Qubes OS kernel does not have a vmlinux provided, and even though kaslr is disabled globally, it is hard to get a pretty kernel backtrace from a list of integers. I can barely make out that in both versions, the function that hogs the CPU is smp_call_function_many_cond
Update2: I happened to use another way to determine the backtrace. 5.15.74 is easier to trigger the lockup bug, so I turn to this version and suspend-resume until lockup happens, and then I triple ctrl-a into xen and tried various keys (virtually all keys except for '*', 'R', and '0'). One interesting key is 'N' - it triggers an NMI. After pressing all keys, I pressed '0', and am surprised to find that xen does not lockup; instead the whole kernel log appears.
Basically, the xen cpu
the kernel dump
This may indicate that whenever there is a lockup, I can first 'N' (trigger an NMI) then '0' (dump dom0 registers) to recover from it. (Why is '0' effective - it should be a vanilla function to print information, why can it unlock the cpus?)
Update3: I have tried once more and it shows that I can use '%' 'd' 'N' 'N' '0' in xen console to recover this lockup (lockup after resume; this does not apply for the other lockup happening before sleeping).
Update4: I tried once more and I found out that when the machine hangs on resuming, '0' itself is enough.
One more trace 5.15.74-2, no tty3 python3 deadloop lock up on suspending xen dumping dom0 register lock up forever, watchdog does not work
hmm
(XEN) Xen call trace:
(XEN) [<ffff82d040224922>] R _spin_lock_irq+0x22/0x50
(XEN) [<ffff82d0402443a3>] S core.c#sched_wait_rendezvous_in+0xe3/0x2b0
(XEN) [<ffff82d04024467f>] S core.c#sched_slave+0x10f/0x270
(XEN) [<ffff82d040227b08>] S timer.c#timer_softirq_action+0x1d8/0x310
(XEN) [<ffff82d040223fca>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN) [<ffff82d0402e43dd>] S domain.c#idle_loop+0x7d/0xf0
(XEN) [<ffff82d0402e4360>] S domain.c#idle_loop+0/0xf0
(XEN)
(XEN) CPU6 @ e033:ffffffff811c9375 (0000000000000000)
(XEN) CPU7 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU0 @ e008:ffff82d040241deb (vcpu_sleep_sync+0xeb/0x140)
(XEN) CPU3 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU2 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU4 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU5 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) FATAL TRAP: vec 2, NMI[0000] IN INTERRUPT CONTEXT
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
says that the CPU is waiting for the sibling to call in. In which case this is perhaps more likely related to the Xen S3/timer issue recently debugged and fixed upstream.
hmm
says that the CPU is waiting for the sibling to call in. In which case this is perhaps more likely related to the Xen S3/timer issue recently debugged and fixed upstream.
I would be glad to test out the updated xen blob if it comes in time. However it seems that xen watchdog does not only caught this backtrace you mentioned - for example (taken out from log I have posted earlier this day):
(XEN) Xen call trace:
(XEN) [<ffff82d040224926>] R _spin_lock_irq+0x26/0x50
(XEN) [<ffff82d0402443a3>] S core.c#sched_wait_rendezvous_in+0xe3/0x2b0
(XEN) [<ffff82d040244915>] S core.c#schedule+0x135/0x250
(XEN) [<ffff82d040223fca>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN) [<ffff82d0402dd856>] S x86_64/entry.S#process_softirqs+0x6/0x20
(XEN)
(XEN) CPU7 @ e008:ffff82d04022493a (_spin_lock_irq+0x3a/0x50)
(XEN) CPU6 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU0 @ e008:ffff82d040241deb (vcpu_sleep_sync+0xeb/0x140)
(XEN) CPU2 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU4 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU3 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU5 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
or
(XEN) Xen call trace:
(XEN) [<ffff82d04024438d>] R core.c#sched_wait_rendezvous_in+0xcd/0x2b0
(XEN) [<ffff82d04024467f>] S core.c#sched_slave+0x10f/0x270
(XEN) [<ffff82d040223fca>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN) [<ffff82d0402e43dd>] S domain.c#idle_loop+0x7d/0xf0
(XEN) [<ffff82d0402e4360>] S domain.c#idle_loop+0/0xf0
(XEN)
(XEN) CPU2 @ 0033:7c0a2d30c3cd (0000000000000000)
(XEN) CPU0 @ e008:ffff82d040241ded (vcpu_sleep_sync+0xed/0x140)
(XEN) CPU3 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU5 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU4 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU7 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
(XEN) CPU6 @ e008:ffff82d04026c31b (mwait_idle_with_hints+0xfb/0x140)
I am experimenting with the xen 4.14.5-12.
kernel 5.15.74: same problem happens:
and then the laptop builtin keyboard does not respond.
I tried doing poweroff on serial port and then another function locks up on smp_call_function_many_cond
. When I turn to xen and try '0', xen itself stuck on one vcpu.
kernel 6.0.2 This time similar problem happens as the kernel 5.15.74. Log:
@andyhhp @marmarek
With the commit you have mentioned (https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=625efe28ab5309ab83f7826ed1de4966ede2f191), I recompiled a xen-4.14.5 using qubes-builder (tweaked the private.h
part by hand to make patch work) and then inserted the xen-4.14.5.gz and boot using this xen blob.
I have tested it on 5.15.74 6.0.2 with or without the python3 -c "while 1:pass"
. It seems that the lockup problems does not occur again. I have tested 50 times on 5.15.74 without python3 forever loop, 50 times with python3 forever loop, 5 times on 6.0.2 without forever loop, 15 times with forever loop, and around 10 times with many VM open (sys-net, sys-firewall, and a simple appvm with firefox playing videos), and no lock up happens again. I have 95% confidence that the various lockup has been solved by this patch.
This patch seems very promising. I will test more on later days.
How to file a helpful issue
Additional bug that triggers once about every week. The original fix says that there are rare cases that it still does not work; however does that case match the problem I am facing?
Qubes OS release
R4.1
Brief summary
Suspension works most of the time in Thinkpad L15 Gen 2 using the temporary patch in #7283 but it breaks occasionally (
once in a weekonce in about 10 times) by being unable to wake up. The power light may blink or keep lighting, the screen is closing, the fan is working but HDD is NOT spinning.Later testing shows that it is unrelated to appVM, bot only to dom0 and xen.
Same bug does not exists on LiveCD Ubuntu 22.04 or Windows.
old summary
Forcefully power off the computer is possible but computer does not boot when powering on later, for around several minutes, and then it recovers and boots up. As far as I am concerned, ABSOLUTELY no logs are available in xen or any VM. The problem is similar to when I attempted to disable Intel PTT (that time both Windows and Qubes OS does not wake up and keeps unable to boot for several minutes), only that it appears occasionally and I cannot trigger it in Windows. I have contacted Lenovo custom service and they says that it is related to either graphical card driver or power supply driver, and I have installed the latest driver under Windows, which may be the reason that windows does not crash. Update: each problem occur when waking up over 12 hours later from suspend; not sure whether very frequent suspend & wake up will cause problem.Steps to reproduce
old steps to reproduce
1. On Thinkpad L15 Gen 2, enable "Linux S3", In R4.1, apply patch in #7283 2. suspend (S3) and wake up for several timessudo systemctl suspend
Expected behavior
Each time computer wakes up.
Actual behavior
old actual behavior
The computer wakes up many times and then it does not wake up and needs to power off and power on a number of times.After an average of 10 times success of resuming, the computer finally gets either:
The freeze does not generate any meaningful log info. The log content of various log files are exactly the state before the machine suspends, without any suspicious failure events.
This behavior is reproduced on xen-hypervisor from 4.14.4-2 to 4.14.5-7 .
@andyhhp @marmarek Can you reproduce the issue?