Closed isodude closed 1 year ago
Problems with GPU reset because Xen and PSP does not work well together, I confirmed this was fixed in Xen 4.15+.
@marmarek can we backport the fix?
Xorg(Mesa I guess) does not handle BO during reset, or reset during unsuspend breaks havoc with BO(?) . Either this is kernel related as well or a bug inside Mesa. When trying Arch as a base I could not get past tmese graphic artifacts. Oddl enough the exact same errors happen on Vega when people play games. Something related to shaders. Maybe turning off 3D plus the latest of everyhing would work and help pin down the issue.
I don’t think this is Qubes OS related. Please report it to Mesa upstream.
@marmarek can we backport the fix?
I can look into it, but first I'd need to identify specific patch.
I can look into it, but first I'd need to identify specific patch.
I compiled the latest of everything yesterday to try to get the same, but obviously did not get the same results as last time around. I will try a bit more to get a good working sample.
What is odd that it does matter a whole lot in which order initramfs loads the firmware and kernel modules. In a certain order suspend will not work at all.
Well it seems like I am dealing with this issue too :/
@isodude, do you have any known working qubes setup? From what I gathered in this thread xen 4.15 solves the PSP issue. Would there be any downside to simply running that? Are there any qubes specific patches needed, or can I just throw upstream xen 4.15 at qubes builder?
Xorg(Mesa I guess) does not handle BO during reset, or reset during unsuspend breaks havoc with BO(?)
I don’t think this is Qubes OS related. Please report it to Mesa upstream.
Any progress on this? Is there an upstream bug report? I haven't run into the issue yet, but if I ever manage to get past the PSP issue, I assume I will.
Problems with GPU reset because Xen and PSP does not work well together, I confirmed this was fixed in Xen 4.15+.
Can you do a bisection @isodude?
Where did we get with this? Is there anything to be done on our side?
This is the only issue keeping me from switching to Qubes full-time, would really love to help get this resolved.
Hi, looking at the logs, this device is likely a Thinkpad T14 AMD Gen 1 or equivalent, which received a new firmware recently. I am wondering if it improves the situation or not. 0.1.41 which resolves some issues with sleep. link: https://support.lenovo.com/tr/en/downloads/ds544977-bios-update-utility-bootable-cd-for-windows-10-64-bit-thinkpad-t14-gen-1-types-20ud-20ue
my current device is a 20ue which cannot resume from sleep and i need a solution as well.
I tried with the newest firmware and kernel-latest (5.18) but I'm getting the infamous 'waiting for fences time out'.
It should be noted that the suspend/resume worked, I'm able to enter my password in the screensaver. It's after that moment that Xorg is having a hard time rendering and after a while gives up. It seems that the kernel manage to actually reset the GPU though.
Regarding the Mesa error, it seems that the error 'Failed to initialize parser -125!', is related to the fact that the GPU got a reset and lost it's VRAM which the DE does not recover from at all (https://bugzilla.kernel.org/show_bug.cgi?id=205089). So the real issue with GPU reset may still exist even in Xen 4.15.
An interesting take from that thread is that GPU resets was solved by fixing memcpy for one specific person: https://gist.github.com/jnettlet/f6f8b49bb7c731255c46f541f875f436 Could there be something like that happening in Xen + PSP? It would make sense that the data is somehow scrambled after resume and that trigger all sort of madness afterwards.
I tried to get suspend working in my arch install again, but it flat out doesn't work anymore. Also, bisecting Xen between 4.14 and 4.15 was horrendous since it didn't compile in the version in between for some reason. I tried to run Xen 4.15 on Qubes R4.1 before but it was a mess, it may be easier now?
I tested a bit more, suspend seems to work fine in console. Which is very good. There's still artifacts after suspend. Starting X after suspend makes X crash, if I roll back linux-firmware I avoid a newly introduced bug.
With linux-firmware-20211216-127 I get this error
Aug 03 03:05:25 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=6012, emitted seq=6014
Aug 03 03:05:25 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 3624 thread X:cs0 pid 3628
Aug 03 03:05:25 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
Aug 03 03:05:25 dom0 kernel: [drm] free PSP TMR buffer
Aug 03 03:05:25 dom0 kernel: CPU: 0 PID: 11 Comm: kworker/u16:1 Tainted: G W 5.18.9-1.fc32.qubes.x86_64 #1
Aug 03 03:05:25 dom0 kernel: Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET72W(1.41 ) 06/27/2022
Aug 03 03:05:25 dom0 kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Aug 03 03:05:25 dom0 kernel: Call Trace:
Aug 03 03:05:25 dom0 kernel: <TASK>
Aug 03 03:05:25 dom0 kernel: dump_stack_lvl+0x45/0x5a
Aug 03 03:05:25 dom0 kernel: amdgpu_reset_reg_dumps.isra.0+0x13/0x93 [amdgpu]
Aug 03 03:05:25 dom0 kernel: amdgpu_do_asic_reset+0x27/0x3a6 [amdgpu]
Aug 03 03:05:25 dom0 kernel: amdgpu_device_gpu_recover_imp.cold+0x524/0x65b [amdgpu]
Aug 03 03:05:25 dom0 kernel: amdgpu_job_timedout+0x17a/0x1b0 [amdgpu]
Aug 03 03:05:25 dom0 kernel: drm_sched_job_timedout+0x76/0x110 [gpu_sched]
Aug 03 03:05:25 dom0 kernel: process_one_work+0x1e5/0x3b0
Aug 03 03:05:25 dom0 kernel: worker_thread+0x49/0x2e0
Aug 03 03:05:25 dom0 kernel: ? rescuer_thread+0x3a0/0x3a0
Aug 03 03:05:25 dom0 kernel: kthread+0xe7/0x110
Aug 03 03:05:25 dom0 kernel: ? kthread_complete_and_exit+0x20/0x20
Aug 03 03:05:25 dom0 kernel: ret_from_fork+0x22/0x30
Aug 03 03:05:25 dom0 kernel: </TASK>
There are some reports around this, specifically about cleaning up old jobs. Maybe that is worth investigating. 5.19-rc7 should work properly according to https://gitlab.freedesktop.org/drm/amd/-/issues/2050. I'll stop investigating for today though.
Tried with kernel 5.19, no change.
@marmarek Is Xen supposed to handle ccp/psp? If not ccp/psp both fail on my system when initializing them. The module get 0xffffffff.. back from reading the registries. I assume this is because Xen does not allow accessing those memory regions. So I tried out dom0-iommu=passthrough=1 and it failed with a blank screen then reboot. I did not manage to catch the output even though I ran with console=vga vga=keep noreboot. Running with dom0-iommu=map-inclusive=1 did not help either. According to the documentation ACPI tables should tell what registries devices talk on, but I guess this is another bug? psp obviously work via amdgpu driver. Not sure if it's worth bothering to debug the ccp/psp combo.
Suspend works in console btw, with the same text-jitter as I've experienced before. I would say this is on par with where my Arch Linux install was. I'm unable to start X properly after suspend, same effect as suspending while inside X.
@isodude would you be able to do a git bisect
between Xen versions to figure out what exact Xen patch fixed the problem? @marmarek any suggestions for doing so without breaking dom0 in the process?
@isodude would you be able to do a
git bisect
between Xen versions to figure out what exact Xen patch fixed the problem?
Oh right, actually finding the commit between xen 4.14.5 and 4.14.4 you mean, that would be easier! That I've done a couple of times so no problems.
@marmarek Is Xen supposed to handle ccp/psp?
I don't think so, it should be all up to dom0.
FWIW, I think the same issue applies to one system in out CI, in the suspend test: https://openqa.qubes-os.org/tests/44851#downloads (see suspend-journalctl.log
). I can review historical logs, but AFAIR it never worked there.
This one has AMD Ryzen 5 4500U.
I did not manage to bisect xen 4.15 - 4.14,
Bisection Xen between major versions is complicated, because of unstable ABI. Without rebuilding several more parts (toolstack, libvirt etc) no VM will start.
Does the issue happen if no VM is running at all too? If so, testing different Xen versions will be much easier. As for Xen patches used in Qubes, most of them are for libxl; the hypervisor itself should work without any patches.
Does the issue happen if no VM is running at all too? If so, testing different Xen versions will be much easier. As for Xen patches used in Qubes, most of them are for libxl; the hypervisor itself should work without any patches.
You could run Xen, suspend, start X, profit. I guess if I don't start lightdm I can run it without entering X and run the programs from console. Earlier I could trig the problem just starting Xen and the suspend without entering X. That now works.
FWIW, I think the same issue applies to one system in out CI, in the suspend test: https://openqa.qubes-os.org/tests/44851#downloads (see
suspend-journalctl.log
). I can review historical logs, but AFAIR it never worked there. This one has AMD Ryzen 5 4500U.
This should be correct. Well that is awesome!
@marmarek Is Xen supposed to handle ccp/psp?
I don't think so, it should be all up to dom0.
Ok, so. Do we have any prior art on how to use rmrr in Xen to allow Dom0 to write/read to ccp? I tried to fiddle around with it yesterday but I really have zero clue :)
how to use rmrr in Xen to allow Dom0 to write/read to ccp?
IIUC dom0 should be able to write to any device by default (unless assigned to another domain). You can check the assignments with xl debug-key Q
ccp is 0000:07:00.2.
[2022-08-04 13:21:39] (XEN) 0000:07:00.6 - d0 - node -1 - MSIs < 103 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.5 - d0 - node -1 - MSIs < 104 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.4 - d1 - node -1 - MSIs < 78 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.3 - d1 - node -1 - MSIs < 76 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.2 - d0 - node -1 - MSIs < 65 66 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.1 - d0 - node -1 - MSIs < 102 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.0 - d0 - node -1 - MSIs < 101 >
sending xl debug-keys h
revealed more info and I ran MSI dump (M)
[2022-08-04 13:35:27] (XEN) MSI 62 vec=41 fixed edge assert phys cpu dest=00000000 mask=0/ /?
[2022-08-04 13:35:27] (XEN) MSI 63 vec=49 fixed edge assert phys cpu dest=00000000 mask=0/ /?
[2022-08-04 13:35:27] (XEN) MSI-X 64 vec=2a fixed edge assert phys cpu dest=0000000c mask=1/ /0
[2022-08-04 13:35:27] (XEN) MSI-X 65 vec=91 fixed edge assert phys cpu dest=00000000 mask=1/HG/1
[2022-08-04 13:35:27] (XEN) MSI-X 66 vec=99 fixed edge assert phys cpu dest=00000000 mask=1/HG/1
[2022-08-04 13:35:27] (XEN) MSI-X 67 vec=d9 fixed edge assert phys cpu dest=0000000e mask=1/ /0
[2022-08-04 13:35:27] (XEN) MSI-X 68 vec=47 fixed edge assert phys cpu dest=00000000 mask=1/ /0
[2022-08-04 13:35:27] (XEN) MSI-X 69 vec=4e fixed edge assert phys cpu dest=0000000e mask=1/ /0
Looking through the code H means host masked and G guest masked.
lspci -vvvv
tells us
07:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
Subsystem: Lenovo Device 5081
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin C routed to IRQ 36
Region 2: Memory at fd200000 (32-bit, non-prefetchable) [size=1M]
Region 5: Memory at fd3cc000 (32-bit, non-prefetchable) [size=8K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [64] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x16 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a0] MSI: Enable- Count=1/2 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable+ Count=2 Masked-
Vector table: BAR=5 offset=00000000
PBA: BAR=5 offset=00001000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Kernel driver in use: ccp
Kernel modules: ccp
It's MSI-X enabled, but not maskable by us. Which I assume is correct since Xen handles the masking. I don't understand why only those MSI have both H and G while none other have it.
Can i jump in and provide any help? I have a Ryzen 5xxx series that also exhibits the same behaviour.
I'm trying out the xen cmdline rmrr=0xfd200000-0xfd2fffff=0:7:0.2
now, but it isn't recognized anywhere.
Can i jump in and provide any help? I have a Ryzen 5xxx series that also exhibits the same behaviour.
What I'm about to do is to bisect Xen to see where between 4.14.X and 4.14.5 the suspend in console started working properly. What versions are you running now and how does suspend work if you stop lightdm (or w/e login manager you have) and run systemctl suspend in console?
Running Xen 4.14.5, after qvm-shutdown --all
and systemctl stop lightdm
, resuming from a suspended state freezes the TTY. Can't switch to another, won't accept any input.
On August 4, 2022 1:01:07 PM UTC, Josef Johansson @.***> wrote:
I'm trying out the xen cmdline
rmrr=0xfd200000-0xfd2fffff=0:7:0.2
now, but it isn't recognized anywhere.Can i jump in and provide any help? I have a Ryzen 5xxx series that also exhibits the same behaviour.
What I'm about to do is to bisect Xen to see where between 4.14.X and 4.14.5 the suspend in console started working properly. What versions are you running now and how does suspend work if you stop lightdm (or w/e login manager you have) and run systemctl suspend in console?
-- Reply to this email directly or view it on GitHub: https://github.com/QubesOS/qubes-issues/issues/6923#issuecomment-1205224774 You are receiving this because you commented.
Message ID: @.***>
Actually, on my second attempt it resumed properly the way it was intended. I can send the relevant parts of the two boot logs if needed, for diffing.
Actually, on my second attempt it resumed properly the way it was intended. I can send the relevant parts of the two boot logs if needed, for diffing.
Great, so we would need to check if the behavior is the same in Xen 4.14.1 through 4.14.4. If you would like to, you could check that. qubes-dom0-update --action downgrade can be used to downgrade. I think that we also need to downgrade all related xen package (rpm -qa | grep xen | grep 4.14.5). I don't remember that well, but I think so anyhow :)
I just did a batch job and tested all the way down to Xen 4.14.1-1 (couldn't install 4.14.0-X). Suspend in console works, so I guess it's a kernel issue.
Just tested latest firmware + 5.19.0 (with buddy patches applied), and X does not lock up the whole desktop in a hard state. Still doesn't work but I would say that's a progress. I may need to do a run later to see if it's the same if I change Xen version. Maybe there's something.
With that said here's the errors. dmesg.txt
Issue opened at freedesktop: https://gitlab.freedesktop.org/drm/amd/-/issues/2114
I tried out linux-firmware-20220804 (with amd firmware 22.20) and it still doesn't work, but the errors are different.
I also discovered that since I chose to compile with PMC=y my dracut failed. I fixed that, might have some affect as well. journal-20220804.txt
Does the issue happen if no VM is running at all too? If so, testing different Xen versions will be much easier.
I did some tests. Here are results:
I have prepared reliable test case (2 suspend attempts + dmesg | grep "PSP resume failed"
) so I can launch automated bisection between arbitrary Xen or Linux versions (it will take time, but it's fully automated so it isn't my time :) ). But for that I need at least one "good" version, which apparently we don't have...
I'm trying out the xen cmdline
rmrr=0xfd200000-0xfd2fffff=0:7:0.2
now, but it isn't recognized anywhere.
This is AMD, so it's called ivmd=
...
I'm trying out the xen cmdline
rmrr=0xfd200000-0xfd2fffff=0:7:0.2
now, but it isn't recognized anywhere.This is AMD, so it's called
ivmd=
...
Right, so that was introduced in xen 4.16. I'll see what I can do here.
I'm trying out the xen cmdline
rmrr=0xfd200000-0xfd2fffff=0:7:0.2
now, but it isn't recognized anywhere.This is AMD, so it's called
ivmd=
...Right, so that was introduced in xen 4.16. I'll see what I can do here.
@marmarek Trying to apply the ivmd patches and backport them, but there is a lot of patches and I give up right now.
xen_rm_opts="edd=off ivmd=fd200-fd2ff=0:07:0.2;fd3cc-fd3cd=0:07:0.2 loglvl=debug"
Maybe try out the test bench and xen master with the above?
I tried, but the issue is still there. I also got this warning:
(XEN) AMD-Vi: Warning: IVMD: [fd200000,fd300000) is not (entirely) in reserved memory
EDIT: after fixing quoting in grub, I get it for both ranges
I tried, but the issue is still there. I also got this warning:
(XEN) AMD-Vi: Warning: IVMD: [fd200000,fd300000) is not (entirely) in reserved memory
EDIT: after fixing quoting in grub, I get it for both ranges
Do you get any change to journalctl -b | grep ccp
?
Interestingly, adding drm.debug=1
to Linux dom0 makes S3 succeed more times (3-4 successful), but it still eventually fails. Some race condition then? Lack of synchronization between gpu and ccp resume?
Would not even surprise me. It's a job that is getting stuck in the amdgpu driver somewhere. That thing is a big mess and their fixes are tryhard at best.
What kernel are you using now?
5.10.61 right now, but AFAIR same happened on newer too. openQA tested at least 5.15.57. I'll test newer one again to be sure.
5.10.61 right now, but AFAIR same happened on newer too. openQA tested at least 5.15.57. I'll test newer one again to be sure.
Interesting things happen in 5.17, so kernel-latest-5.18 is quite interesting.
I'm trying to build linux kernel with gcc 12 to get KCSAN.
I've tried 5.19, the issue is still there, looks exactly the same
Booting kernel compiled with gcc 12.1 and KCSAN support made Xen crash with page fault in x86_64/entry.S#create_bounce_frame+0x135/0x157. Saw that there's one issue in 2017 in grub with almost the same problem but other than that seems like I'm doing something wrong :)
This might be very related. Regarding tee init command failed
https://lists.xenproject.org/archives/html/xen-devel/2022-06/msg01383.html
I tried this quick workaround but it still failed to init:
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: enabling device (0000 -> 0002)
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: ccp: unable to access the device: you might be running a broken BIOS.
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: tee: ring init command failed (0x00000005)
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: tee: failed to init ring buffer
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: tee initialization failed
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: psp initialization failed
I added also dom0-iommu=strict iommu=verbose
to Xen, in hope to get some IOMMU page fault if that's really about address given to the device, but no failure got logged.
As for the "broken BIOS" message, it can be relevant. I'll try newer firmware, but I don't have high hopes. I haven't found any related option in firmware settings.
I see psp driver also has similar check that IIUC happens before ccp one, and it doesn't fire on this machine, so at least one thing appears to work...
When I run the same Linux without Xen, I still get the "broken BIOS" message but then tee enabled
and psp enabled
.
(still old firmware)
I'm trying to patch tee-dev.c such that ring init does not fail, will probably spend some time with it tonight. Join me if you would like :)
So, I found one more place that needs similar patch, https://gist.github.com/marmarek/0d14a340d7045f21d8c1c35ccad4c6c4. Now I get:
[ 12.775538] ccp 0000:04:00.2: enabling device (0000 -> 0002)
[ 12.775681] xen: registering gsi 36 triggering 0 polarity 1
[ 12.775692] Already setup the GSI :36
[ 12.776025] ccp 0000:04:00.2: ccp: unable to access the device: you might be running a broken BIOS.
[ 12.786222] ccp 0000:04:00.2: tee enabled
[ 12.786243] ccp 0000:04:00.2: psp enabled
But the resume issue is still there
@marmarek looks like __psp_pa()
is broken on Xen. Better to change it once than everywhere.
@marmarek looks like
__psp_pa()
is broken on Xen. Better to change it once than everywhere.
Oh, better to override the #define
you mean. Well, it worked on my system rocking Xen 4.14.5, so it will make it better for folks anyhow. No resume still though. I'm trying to get my kernel to work properly now with some more kernel options. I'm running a much later linux-firmware so I hope that will yield some different results.
(Thanks a lot for your efforts on this guys.)
Compiled the patch I commented in the amd issue and manage to boot it such that psp: enabled. I also compiled with CONFIG_AMDTEE. Same problem.
I will try with drm.debug to see if I can reproduce what marmarek noticed. I figure that we may have some problem in dma_fence in amdgpu.
Could this be related?
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -34,6 +34,7 @@
#include <linux/mman.h>
#include <linux/file.h>
#include <linux/pm_runtime.h>
#include "amdgpu_amdkfd.h"
#include "amdgpu.h"
@@ -1969,7 +1970,7 @@ int kfd_reserved_mem_mmap(struct kfd_dev *dev, struct kfd_process *process,
| VM_NORESERVE | VM_DONTDUMP | VM_PFNMAP;
/* Mapping pages to user process */
return remap_pfn_range(vma, vma->vm_start,
PFN_DOWN(__pa(qpd->cwsr_kaddr)),
Solved as of
linux-firmware-20230123-135.fc32.noarch xen-4.14.5-20.fc32.x86_64 kernel-latest-6.2.10-1.qubes.fc32.x86_64
Qubes OS release
R4.1, kernel 5.14.7-1 (fedora 5.14) (same behavior in lower kernels.) XEN 4.14.3 (build from @marmarek branch)
Brief summary
Laptops does not resume after third sleep/resume cycle. The problem seems to be with
It feels like there's a hung process in the amdgpu drivers for some reason.
Not sure how to debug this properly, XEN is not giving me much info at all. The problem is visible with X started as well obviously but I try to make the bug surface smaller.
Steps to reproduce
Boot laptop with X disabled, no VMs started. run systemctl suspend three times (and resuming) run reboot to restore system
Expected behavior
Possible to suspend limitless.
Actual behavior
Screen does not wake up on third resume. It's possible to write
reboot
and restart.Notes
Works well with kernel booted without XEN. crash.filtered.log crash.filtered.xen.log
Workarounds
A bit more testing is needed but I do have sort of stable suspend/resume now. It even survives when everything goes south. There's a bit of tearing, but I'd rather have suspend than tearing.
Compile
xorg-x11-drv-amdgpu
from https://github.com/freedesktop/xorg-xf86-video-amdgpu Runmake install
and installamdgpu_drv.so
in/usr/lib64/xorg/modules/drivers
on dom0.For more stability run with kernel cmdline
preempt=none
Do note that e.g. 4k external screen will be royally sluggish.
Sometimes the screen turns up black, type in the password anyhow and switch to tty2 and back again / suspend-resume again and it will most likely come to life again. Suspend/resume too fast could lead to instant reboot.