QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
541 stars 48 forks source link

Screen does not wake up after resume (AMD Ryzen 7 Pro 4750U) #6923

Closed isodude closed 1 year ago

isodude commented 3 years ago

Solved as of

linux-firmware-20230123-135.fc32.noarch xen-4.14.5-20.fc32.x86_64 kernel-latest-6.2.10-1.qubes.fc32.x86_64

Qubes OS release

R4.1, kernel 5.14.7-1 (fedora 5.14) (same behavior in lower kernels.) XEN 4.14.3 (build from @marmarek branch)

Brief summary

Laptops does not resume after third sleep/resume cycle. The problem seems to be with

[drm] psp command (0x7) failed and response status is (0xFFFF0007)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmp failed!

It feels like there's a hung process in the amdgpu drivers for some reason.

Not sure how to debug this properly, XEN is not giving me much info at all. The problem is visible with X started as well obviously but I try to make the bug surface smaller.

Steps to reproduce

Boot laptop with X disabled, no VMs started. run systemctl suspend three times (and resuming) run reboot to restore system

Expected behavior

Possible to suspend limitless.

Actual behavior

Screen does not wake up on third resume. It's possible to write reboot and restart.

Notes

Works well with kernel booted without XEN. crash.filtered.log crash.filtered.xen.log

Workarounds

A bit more testing is needed but I do have sort of stable suspend/resume now. It even survives when everything goes south. There's a bit of tearing, but I'd rather have suspend than tearing.

cat << > /etc/X11/xorg.conf.d/50-video.conf 
Section "Device"
    Identifier "card0"
    Driver "amdgpu"
    Option "AccelMethod" "none"
EndSection

Compile xorg-x11-drv-amdgpu from https://github.com/freedesktop/xorg-xf86-video-amdgpu Run make install and install amdgpu_drv.so in /usr/lib64/xorg/modules/drivers on dom0.

For more stability run with kernel cmdline preempt=none

Do note that e.g. 4k external screen will be royally sluggish.

Sometimes the screen turns up black, type in the password anyhow and switch to tty2 and back again / suspend-resume again and it will most likely come to life again. Suspend/resume too fast could lead to instant reboot.

isodude commented 3 years ago

The Xen processor (-19) from ACPI errors go away if I boot the kernel with nosmt, obviously.

In the console with lightdm never started it can survive at least 5-6 suspend-resume-cycles now.

Now compiling the kernel with CONFIG_DRM_AMD_DC_HDCP=n CONFIG_HSM_AMD_SVM=n CONFIG_AMD_MEM_ENCRYPT=n

isodude commented 3 years ago

There is a problem with installing xorg-x11-driver-amdgpu, X won't start with errors related to unwind information not existing.I tried installing kernel-devel to make the amdgpu driver happy but it did not work out.

isodude commented 3 years ago

Compiling the kernel without the mentioned flags above I managed to do a sleep/resume a lot longer.

When X is running it still dies on 'failed to terminate hdcp ta' anyhow though.

Not getting the xorg amdgpu driver to work even though I boot with older kernels.

isodude commented 3 years ago

For those wondering how to build xen, here is my builder.conf.

# Since it's a very upstream branch
INSECURE_SKIP_CHECKING = vmm-xen
GIT_URL_vmm_xen = https://github.com/marmarek/qubes-vmm-xen
BRANCH_vmm_xen = update-4.14.3
COMPONENTS = \
builder \
builder-rpm \
vmm-xen

BUILDER_PLUGINS += builder-rpm
isodude commented 3 years ago

amdgpu xorg driver now works with xorg-x11-drv-amdgpu-21.0.0-1 (https://fedora.pkgs.org/33/fedora-updates-x86_64/xorg-x11-drv-amdgpu-21.0.0-1.fc33.x86_64.rpm.html), not stable during suspend/resume or removing jitter after resume.

johnnyboy-3 commented 3 years ago

Thanks for the help isodude. Tried xen 4.14.3 and kernel 5.13.13 and resuming from suspend is still broken (Ryzen 2400G). smt is off.

dom0 kernel: ------------[ cut here ]------------ dom0 kernel: WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:462 switch_mm_irqs_off+0x381/0x3a0 dom0 kernel: Modules linked in: loop nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 snd_hda_codec_realtek cfg80211 snd_hda_codec_gener> dom0 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.13-1.fc32.qubes.x86_64 #1 dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019 dom0 kernel: RIP: e030:switch_mm_irqs_off+0x381/0x3a0 dom0 kernel: Code: 00 00 65 48 89 05 e7 8f fa 7e e9 77 fd ff ff b9 49 00 00 00 b8 01 00 00 00 31 d2 0f 30 e9 57 fd ff ff 41 89 f6 e9 9d fe ff ff <0f> 0b e8 > dom0 kernel: RSP: e02b:ffffc900400afeb8 EFLAGS: 00010006 dom0 kernel: RAX: 000000000ea3c000 RBX: ffff8881002c4f00 RCX: 0000000000000040 dom0 kernel: RDX: ffff8881002c4f00 RSI: 0000000000000000 RDI: ffff88808ea3c000 dom0 kernel: RBP: ffffffff829d84e0 R08: 0000000000000000 R09: 0000000000000000 dom0 kernel: R10: 0000000000000004 R11: 0000000000000000 R12: ffff888100236a40 dom0 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 dom0 kernel: FS: 0000000000000000(0000) GS:ffff888127240000(0000) knlGS:0000000000000000 dom0 kernel: CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 dom0 kernel: CR2: 00005bde388bd0e8 CR3: 0000000002810000 CR4: 0000000000050660 dom0 kernel: Call Trace: dom0 kernel: switch_mm+0x1c/0x30 dom0 kernel: play_dead_common+0xa/0x20 dom0 kernel: xen_pv_play_dead+0xa/0x60 dom0 kernel: do_idle+0xd1/0xe0 dom0 kernel: cpu_startup_entry+0x19/0x20 dom0 kernel: asm_cpu_bringup_and_idle+0x5/0x1000 dom0 kernel: ---[ end trace 75177836fdaa3aca ]--- ... dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU1 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU3 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU5 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU7 dom0 kernel: cpu 1 spinlock event irq 67 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) ... dom0 kernel: cpu 2 spinlock event irq 73 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) ... dom0 kernel: cpu 3 spinlock event irq 79 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) ... dom0 kernel: [drm] psp command (0x5) failed and response status is (0x0) dom0 kernel: [drm:psp_hw_start [amdgpu]] ERROR PSP load tmr failed! dom0 kernel: [drm:psp_resume [amdgpu]] ERROR PSP resume failed dom0 kernel: [drm:amdgpu_device_fw_loading [amdgpu]] ERROR resume of IP block failed -22 dom0 kernel: amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_resume failed (-22). dom0 kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -22 dom0 kernel: amdgpu 0000:06:00.0: PM: failed to resume async: error -22

johnnyboy-3 commented 3 years ago

with smt on:

dom0 kernel: ------------[ cut here ]------------ dom0 kernel: WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:462 switch_mm_irqs_off+0x381/0x3a0 dom0 kernel: Modules linked in: nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 snd_hda_codec_realtek snd_hda_codec_hdmi snd_hdacodec> dom0 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.13-1.fc32.qubes.x86_64 #1 dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019 dom0 kernel: RIP: e030:switch_mm_irqs_off+0x381/0x3a0 dom0 kernel: Code: 00 00 65 48 89 05 e7 8f fa 7e e9 77 fd ff ff b9 49 00 00 00 b8 01 00 00 00 31 d2 0f 30 e9 57 fd ff ff 41 89 f6 e9 9d fe ff ff <0f> 0b e8 > dom0 kernel: RSP: e02b:ffffc900400afeb8 EFLAGS: 00010006 dom0 kernel: RAX: 00000001023e0000 RBX: ffff8881002c8000 RCX: 0000000000000040 dom0 kernel: RDX: ffff8881002c8000 RSI: 0000000000000000 RDI: ffff8881823e0000 dom0 kernel: RBP: ffffffff829d84e0 R08: 0000000000000000 R09: 0000000000000000 dom0 kernel: R10: 0000000000000008 R11: 0000000000000000 R12: ffff88810a523300 dom0 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 dom0 kernel: FS: 0000000000000000(0000) GS:ffff888127240000(0000) knlGS:0000000000000000 dom0 kernel: CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 dom0 kernel: CR2: 00007202ec011726 CR3: 0000000002810000 CR4: 0000000000050660 dom0 kernel: Call Trace: dom0 kernel: switch_mm+0x1c/0x30 dom0 kernel: play_dead_common+0xa/0x20 dom0 kernel: xen_pv_play_dead+0xa/0x60 dom0 kernel: do_idle+0xd1/0xe0 dom0 kernel: cpu_startup_entry+0x19/0x20 dom0 kernel: asm_cpu_bringup_and_idle+0x5/0x1000 dom0 kernel: ---[ end trace 38fb75148761bdb4 ]--- ... dom0 kernel: cpu 1 spinlock event irq 67 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 2 spinlock event irq 73 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 3 spinlock event irq 79 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 4 spinlock event irq 85 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 5 spinlock event irq 91 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 6 spinlock event irq 97 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 7 spinlock event irq 103 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) ... dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=8448, emitted seq=8450 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xorg pid 3765 thread X:cs0 pid 3839

... dom0 kernel: amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring gfx test failed (-110) dom0 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] ERROR resume of IP block failed -110 ... dom0 kernel: kfd kfd: amdgpu: error getting iommu info. is the iommu enabled? dom0 kernel: kfd kfd: amdgpu: Error initializing iommuv2 dom0 kernel: kfd kfd: amdgpu: device 1002:15dd NOT added due to errors ... dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, but soft recovered

isodude commented 3 years ago

@johnnyboy-3 do you have xorg-x11-drv-amdgpu installed?

johnnyboy-3 commented 3 years ago

xorg-x11-drv-amdgpu v19.1.0-3 installed.

Also tried Linux Kernel 5.14.9-1 with the same bug. This time with new errors in journalctl on resume:

dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=9917, emitted seq=9919 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xorg pid 3038 thread X:cs0 pid 3819 dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset begin! dom0 kernel: amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring kiq_2.1.0 test failed (-110) dom0 kernel: [drm] free PSP TMR buffer dom0 kernel: [drm] psp command (0x7) failed and response status is (0x0) dom0 kernel: [drm:psp_suspend [amdgpu]] ERROR Failed to terminate tmr dom0 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] ERROR suspend of IP block failed -22 dom0 kernel: ------------[ cut here ]------------ dom0 kernel: WARNING: CPU: 3 PID: 4326 at include/drm/ttm/ttm_bo_api.h:580 amdgpu_bo_unpin+0x5a/0xa0 [amdgpu] dom0 kernel: Modules linked in: nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 cfg80211 rfkill libarc4 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec intel_rapl_msr intel_rapl_common snd_hda_core snd_hwdep joydev snd_seq snd_seq_device snd_pcm snd_timer snd soundcore wmi_bmof r8169 pcspkr sp5100_tco i2c_piix4 k10temp gpio_amdpt gpio_generic wmi video xenfs fuse ip_tables dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt trusted asn1_encoder amdgpu crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm_ttm_helper ttm iommu_v2 ccp gpu_sched i2c_algo_bit drm_kms_helper cec drm xhci_pci xhci_pci_renesas xhci_hcd xen_acpi_processor xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput dom0 kernel: CPU: 3 PID: 4326 Comm: kworker/3:4 Tainted: G W 5.14.9-1.fc32.qubes.x86_64 #1 dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019 dom0 kernel: Workqueue: events drm_sched_job_timedout [gpu_sched] dom0 kernel: RIP: e030:amdgpu_bo_unpin+0x5a/0xa0 [amdgpu] dom0 kernel: Code: 75 25 48 8b bd 48 01 00 00 48 85 ff 74 05 e8 3d e2 5e c1 48 8b 85 c0 01 00 00 8b 40 10 83 f8 02 74 24 83 f8 01 74 0d 5b 5d c3 <0f> 0b 8b 85 04 02 00 00 eb ca 48 8b 85 30 01 00 00 f0 48 29 83 50 dom0 kernel: RSP: e02b:ffffc9004242fcb0 EFLAGS: 00010246 dom0 kernel: RAX: 0000000000000000 RBX: ffff88810d385288 RCX: 0000000000000000 dom0 kernel: RDX: ffff888013cc8000 RSI: 0000000000000000 RDI: ffff88810a717800 dom0 kernel: RBP: ffff88810a717800 R08: 0000000000000003 R09: 000000000036d488 dom0 kernel: R10: ffffc9004242fad8 R11: ffffffff82947168 R12: ffff88810d385288 dom0 kernel: R13: ffff88810a717800 R14: ffff888107321c00 R15: 0000000000000000 dom0 kernel: FS: 0000000000000000(0000) GS:ffff8881272c0000(0000) knlGS:0000000000000000 dom0 kernel: CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 dom0 kernel: CR2: 000074eab8891fb8 CR3: 0000000107554000 CR4: 0000000000050660 dom0 kernel: Call Trace: dom0 kernel: amdgpu_gart_table_vram_unpin+0x54/0xc0 [amdgpu] dom0 kernel: gmc_v9_0_hw_fini+0x5f/0x80 [amdgpu] dom0 kernel: amdgpu_device_ip_suspend_phase2+0xc5/0x150 [amdgpu] dom0 kernel: amdgpu_device_ip_suspend+0x32/0x60 [amdgpu] dom0 kernel: amdgpu_device_pre_asic_reset+0xa8/0x250 [amdgpu] dom0 kernel: amdgpu_device_gpu_recover.cold+0x53d/0x78e [amdgpu] dom0 kernel: amdgpu_job_timedout+0x17a/0x1a0 [amdgpu] dom0 kernel: drm_sched_job_timedout+0x74/0x110 [gpu_sched] dom0 kernel: process_one_work+0x1ec/0x390 dom0 kernel: worker_thread+0x4a/0x320 dom0 kernel: ? process_one_work+0x390/0x390 dom0 kernel: kthread+0x10f/0x130 dom0 kernel: ? set_kthread_struct+0x40/0x40 dom0 kernel: ret_from_fork+0x22/0x30 dom0 kernel: ---[ end trace d480e2c68621aa89 ]--- dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume dom0 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15dd dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset(2) failed dom0 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart dom0 kernel: kfd kfd: amdgpu: error getting iommu info. is the iommu enabled? dom0 kernel: kfd kfd: amdgpu: Error initializing iommuv2 dom0 kernel: kfd kfd: amdgpu: device 1002:15dd NOT added due to errors dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset end with ret = -6 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, but soft recovered

isodude commented 3 years ago

@johnnyboy-3 the correct kernel parameter should be nosmt btw. It's odd that xen_acpi_processor tries to send updates to XEN on thread number 2 on each processor, even though the kernel is booted with nosmt. It even says SMT: Disabled in boot.

marmarek commented 3 years ago

I think kernel doesn't have full knowledge which thread is running where, only Xen has direct access to that info. And in fact vcpu 2 of dom0 doesn't necessarily run on physical core/thread 2. This also means "nosmt" kernel option is not an effective mitigation against speculative execution bugs, when running under Xen.

isodude commented 3 years ago

cool, so like a normal VM then. So like xen_acpi_processor trying to send up information about 16 cores can just be ignored.

Trying to understand and pin down exactly what makes the amdgpu drivers flip the switch and die on me when resuming, not sure which avenues are best to visit any longer in the debugging hunt.

isodude commented 3 years ago

I just tried out kernel 5.15-rc5 and it's still the same behavior, however I had the laptop in sleep for the whole night and it woke up fine. Still this thing with artifacts around text sometimes when text is written to the screen.

I did one change though, I move away the ati_drv.so from /usr/lib64/xorg/modules/drivers, and I feel that xorg just behaves so much better now. Even though I can't read any direct differences in Xorg.0.log. I managed to suspend/resume a solid three times before amdgpu drivers giving up on SETUP_TMR command (which now is written out in the log due to the late kernel).

Just a note: One thing I'm concern about is that I need to revert (PCI/MSI: Use new mask/unmask functions), somewhere between 5.15-rc1 and 5.15-rc2 it was fixed, but between rc2 and rc4 it was unfixed again. I do have to bisect this. Since amdgpu dies hard on this, maybe it's a bug in their driver that just surfaces in the new mask/unmask functions.

The error that the kernel dies on this time is

[drm] psp gfx command SETUP_TMR(0x5) failed and response status is (0x0)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmr failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed

I'm not sure that this is the culprit or the fact that amdgpu just fails with firmware load on resume sometimes, I've seen HDCP fail as well. I've tried to unload TMR (Trusted Memory Region) by setting CONFIG_AMD_ENCRYPT_MEM=n. TBH I don't know what Xens standpoint is about those features, maybe @marmarek knows? But in general the kernel dies on HDCP and TMR.

isodude commented 3 years ago

Yay, latest kernel-ark with

CONFIG_SND_SOC_AMD_RENOIR=n
CONFIG_DRM_AMD_DC_HDCP=n
CONFIG_DRM_AMD_SECURE_DISPLAY=n
CONFIG_HSA_AMD_SVM=n
CONFIG_AMD_MEM_ENCRYPT=n

booting with kernel options pci=nomsi

Now it actually suspends/resumes correctly.

Attached is lspci -vv with Enable+ selected. lscpi-msi.log

johnnyboy-3 commented 3 years ago

Tried dom0 linux kernel 5.10.61 recompilation with mentioned kernel & boot options on R4.1 - no luck.

isodude commented 3 years ago

@johnnyboy-3 I guess you need to be past the new MSI mask/unmask patches (somewhere between 5.14 and 5.15). I tried 5.12.14 and it was no go there. I can update my linux-kernel-tree if you'd like.

I did manage to get a crash, in like the 10th-15ths resume. Pretty much when the usb ports resetted. It feels like the problem may be in how the USB is done. I try to ignore 02:00.4 (the USB ports in the expansion port), but I lack the expertise to tell Xen just to ignore them. Soon I'll rip out ehci from the kernel :)

Looking at my lspci log it seems that xhci and ehci got MSI disabled, but not the other AMD PCI devices.

isodude commented 3 years ago

15h sleep with 0.277Wh, that's pretty solid for S3! 5.15 Worked with these kernel configs and pci=nomsi.

CONFIG_AMD_PMC=y
CONFIG_HSA_AMD=n

These were set but I don't think they do any difference.

CONFIG_DRM_AMD_DC_HDCP=n
CONFIG_DRM_AMD_SECURE_DISPLAY=n

Text-jitter is almost gone completely compared to before.

I am going to compile 5.14.9 and see how well that fares with CONFIG_AMD_PMC=y CONFIG_HSA=AMD=n, because there's no need for disabling MSI. Then I'm going to bisect the problems with MSI in 5.15.

johnnyboy-3 commented 3 years ago

Thats some good news!

I can update my linux-kernel-tree if you'd like.

Thanks for your offer but I don't think that's necessary for now. I wonder if this problem can be fixed on older kernels in Qubes R4.0 too.

isodude commented 3 years ago

5.14.9 doesn't work that well out of the box, with pci=nomsi it's quirky (external screen dies sometimes, internal screen dies somtimes), but I've suspend/resumed at least 10 times now without reboot. Not how well it works in 5.15 with pci=nomsi though.

This is 5.14.9 (latest qubes-linux-kernel) with

CONFIG_AMD_PMC=y
CONFIG_HSA_AMD=n

Will try to get tip booted without pci=nomsi now, that should be fun!

isodude commented 3 years ago

Thanks for your offer but I don't think that's necessary for now. I wonder if this problem can be fixed on older kernels in Qubes R4.0 too.

I'm pessimistic! There's alot of changes between those kernels and the new ones.

isodude commented 3 years ago

With some patches in msi drivers I got kernel 5.15 working.

X is restarting once in a while, but that's fine since X running inside VMs survive :) I guess that relates to my hacked up X amdgpu drivers.

bigdx commented 3 years ago

Progress, yeah! ^^

With some patches in msi drivers I got kernel 5.15 working.

X is restarting once in a while, but that's fine since X running inside VMs survive :) I guess that relates to my hacked up X amdgpu drivers.

You are running a clean R4.1 RC1 or did you add/changed anything beside modified Kernel 5.15 and msi drivers? Kernel self-compiled with CONFIG_AMD_PMC=y and CONFIG_HSA_AMD=n, right? What msi patches? Anything else?

I tried RC1 out of the box and with kernel-latest 5.14.10 (testing) but same issue as before, just to be sure ^^

isodude commented 3 years ago

builder.conf:

GIT_URL_linux_kernel = https://github.com/isodude/qubes-linux-kernel
BRANCH_linux_kernel = devel-5.15

I don't get how I should make make get-sources work properly, but I download it manually instead.

wget https://gitlab.com/cki-project/kernel-ark/-/archive/v5.15-rc5/kernel-ark-v5.15-rc5.tar.bz2

unpack it, rename the folder to linux-5.15-rc5, pack it again as .tar.

I'm compiling the kernel now to see if it really works with what I commited. It's quirky right now, but haven't had to reboot the system yet.

isodude commented 3 years ago

I updated the patch a bit for MSI a bit, to reflect what actually was missing between the two commits (adding msi functions vs removing old ones).

I see these flip done timeouts still though, I though it was going quite good with the new MSI patches but I get flip done timeout anyways, but not near as bad as without them.

Something is stuck somewhere and I have no clue how to even see what is wrong. All I know is, 5.15 with pci=nomsi is a good combo at least. Even 5.15 without X started doesn't fare good with suspend, but 5.15 with pci=nomsi just keeps going even if there's errors.

If anyone has any idea about what to do or what to analyze, please do tell.

na-- commented 3 years ago

@isodude, I don't understand a lot of the things you're trying and discussing here, but do you think this amdgpu issue I'm experiencing with Ryzen 7 4800H might be related to this one? FWIW, I also have the same "screen does not wake up after resume" issue as well, though I haven't actually tried to diagnose that at all... :sweat_smile:

isodude commented 3 years ago

@na-- I've been quite pessimistic lately towards that there's one fix to rule them all. But rather a whole slew of small patches that makes up the forest. There's a bunch om upstreamed commits regarding the amdgpu driver that's worth testing out.

I just sent in the PCI/MSI-patch this morning and I hope that it gets accepted: https://lore.kernel.org/linux-pci/859dbb71-098f-07f2-f063-4874ccc8523b@oderland.se/T/#u that will make our life a bit easier when testing out 5.15 kernel, it's already included in my qubes-linux-kernel branch.

isodude commented 3 years ago

I've got some updates, hopefully better updates after a bit more trial.

Anyhow, I tried compiling withing CCP (the AMD crypto co processor) and the system survived all resumes I threw at it. Sometime sdma just drops dead and I have to restart lightdm (doing it with a blank screen), but after that suspend/resume works anyhow. I'm doing some patching on the amdgpu-drivers now around the sdma resume area. Hopefully that will yield good result.

It would be nice to see if disabling CCP has good effect on versions below 5.15 as well.

tzelch commented 3 years ago

thanks for all the work you're putting into this

isodude commented 3 years ago

Hopefully we will reach some sort of end to it all, thanks for following along :)

I realized that I had a config file installed that may not be common knowledge.

in /etc/X11/xorg.conf.d/50-video.conf

Section "Device"
  Identifier "card0"
  Driver "amdgpu"
  Option "AccelMethod" "none"
EndSection

Do you guys have that? There's no acceleration to talk of in Xen anyhow, and without that snippet my suspend/resume have really odd effects and X suddendly hard crashing.

DemiMarie commented 3 years ago

There's no acceleration to talk of in Xen anyhow

This is actually not true. Hardware acceleration in the GUI qube should work just fine. Please report a bug if it does not.

johnnyboy-3 commented 3 years ago

@isodude I don't have such a file on my 4.1 testing setup nor on my 4.0 machine. When i add this file to 4.1 and reboot, Xorg/lightdm won't boot up:


[    46.708] (II) AMDGPU(0): Front buffer pitch: 7680 bytes
[    46.708] (==) AMDGPU(0): DRI3 disabled
[    46.708] (==) AMDGPU(0): Backing store enabled
[    46.708] (WW) AMDGPU(0): Direct rendering disabled
[    46.708] (II) AMDGPU(0): 2D and 3D acceleration disabled
[    46.708] (==) AMDGPU(0): DPMS enabled
[    46.708] (==) AMDGPU(0): Silken mouse enabled
[    46.722] (II) Initializing extension Generic Event Extension
[    46.723] (II) Initializing extension SHAPE
[    46.723] (II) Initializing extension MIT-SHM
[    46.723] (II) Initializing extension XInputExtension
[    46.723] (II) Initializing extension XTEST
[    46.723] (II) Initializing extension BIG-REQUESTS
[    46.723] (II) Initializing extension SYNC
[    46.723] (II) Initializing extension XKEYBOARD
[    46.723] (II) Initializing extension XC-MISC
[    46.724] (II) Initializing extension XFIXES
[    46.724] (II) Initializing extension RENDER
[    46.724] (II) Initializing extension RANDR
[    46.724] (II) Initializing extension COMPOSITE
[    46.724] (II) Initializing extension DAMAGE
[    46.724] (II) Initializing extension MIT-SCREEN-SAVER
[    46.724] (II) Initializing extension DOUBLE-BUFFER
[    46.725] (II) Initializing extension RECORD
[    46.725] (II) Initializing extension DPMS
[    46.725] (II) Initializing extension Present
[    46.725] (II) Initializing extension DRI3
[    46.725] (II) Initializing extension X-Resource
[    46.725] (II) Initializing extension XVideo
[    46.725] (II) Initializing extension XVideo-MotionCompensation
[    46.725] (II) Initializing extension SELinux
[    46.725] (II) SELinux: Disabled on system
[    46.725] (II) Initializing extension GLX
[    46.725] (II) AIGLX: Screen 0 is not DRI2 capable
[    46.730] (II) IGLX: Loaded and initialized swrast
[    46.730] (II) GLX: Initialized DRISWRAST GL provider for screen 0
[    46.730] (II) Initializing extension XFree86-VidModeExtension
[    46.730] (II) Initializing extension XFree86-DGA
[    46.731] (II) Initializing extension XFree86-DRI
[    46.731] (II) Initializing extension DRI2
[    46.731] (II) AMDGPU(0): Setting screen physical size to 508 x 285
[    46.738] (EE) 
[    46.738] (EE) Backtrace:
[    46.739] (EE) 0: /usr/bin/X (OsLookupColor+0x139) [0x5d3a20d6e3e9]
[    46.739] (EE) 1: /lib64/libpthread.so.0 (funlockfile+0x60) [0x774f7d311a90]
[    46.740] (EE) 2: /lib64/libc.so.6 (gsignal+0x145) [0x774f7d16d7d5]
[    46.741] (EE) 3: /lib64/libc.so.6 (abort+0x127) [0x774f7d156895]
[    46.741] (EE) 4: /lib64/libc.so.6 (__assert_fail_base.cold+0xf) [0x774f7d156769]
[    46.742] (EE) 5: /lib64/libc.so.6 (__assert_fail+0x46) [0x774f7d165e86]
[    46.742] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.742] (EE) 6: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8de335]
[    46.742] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.742] (EE) 7: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8de6b2]
[    46.743] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.743] (EE) 8: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8ea81d]
[    46.743] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.743] (EE) 9: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8ec97a]
[    46.743] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.743] (EE) 10: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8ee657]
[    46.744] (EE) 11: /usr/bin/X (MapWindow+0x24c) [0x5d3a20c35b5c]
[    46.744] (EE) 12: /usr/bin/X (InitFonts+0x355) [0x5d3a20c0caa5]
[    46.744] (EE) 13: /lib64/libc.so.6 (__libc_start_main+0xf2) [0x774f7d158082]
[    46.745] (EE) 14: /usr/bin/X (_start+0x2e) [0x5d3a20bf5e6e]
[    46.745] (EE) 
[    46.745] (EE) 
isodude commented 3 years ago

There's no acceleration to talk of in Xen anyhow

This is actually not true. Hardware acceleration in the GUI qube should work just fine. Please report a bug if it does not.

I will, I guess my other issue would suffice: #7002

isodude commented 3 years ago

@isodude I don't have such a file on my 4.1 testing setup nor on my 4.0 machine. When i add this file to 4.1 and reboot, Xorg/lightdm won't boot up:


[    46.708] (II) AMDGPU(0): Front buffer pitch: 7680 bytes
[    46.708] (==) AMDGPU(0): DRI3 disabled
[    46.708] (==) AMDGPU(0): Backing store enabled
[    46.708] (WW) AMDGPU(0): Direct rendering disabled
[    46.708] (II) AMDGPU(0): 2D and 3D acceleration disabled
[    46.708] (==) AMDGPU(0): DPMS enabled
[    46.708] (==) AMDGPU(0): Silken mouse enabled
[    46.722] (II) Initializing extension Generic Event Extension
[    46.723] (II) Initializing extension SHAPE
[    46.723] (II) Initializing extension MIT-SHM
[    46.723] (II) Initializing extension XInputExtension
[    46.723] (II) Initializing extension XTEST
[    46.723] (II) Initializing extension BIG-REQUESTS
[    46.723] (II) Initializing extension SYNC
[    46.723] (II) Initializing extension XKEYBOARD
[    46.723] (II) Initializing extension XC-MISC
[    46.724] (II) Initializing extension XFIXES
[    46.724] (II) Initializing extension RENDER
[    46.724] (II) Initializing extension RANDR
[    46.724] (II) Initializing extension COMPOSITE
[    46.724] (II) Initializing extension DAMAGE
[    46.724] (II) Initializing extension MIT-SCREEN-SAVER
[    46.724] (II) Initializing extension DOUBLE-BUFFER
[    46.725] (II) Initializing extension RECORD
[    46.725] (II) Initializing extension DPMS
[    46.725] (II) Initializing extension Present
[    46.725] (II) Initializing extension DRI3
[    46.725] (II) Initializing extension X-Resource
[    46.725] (II) Initializing extension XVideo
[    46.725] (II) Initializing extension XVideo-MotionCompensation
[    46.725] (II) Initializing extension SELinux
[    46.725] (II) SELinux: Disabled on system
[    46.725] (II) Initializing extension GLX
[    46.725] (II) AIGLX: Screen 0 is not DRI2 capable
[    46.730] (II) IGLX: Loaded and initialized swrast
[    46.730] (II) GLX: Initialized DRISWRAST GL provider for screen 0
[    46.730] (II) Initializing extension XFree86-VidModeExtension
[    46.730] (II) Initializing extension XFree86-DGA
[    46.731] (II) Initializing extension XFree86-DRI
[    46.731] (II) Initializing extension DRI2
[    46.731] (II) AMDGPU(0): Setting screen physical size to 508 x 285
[    46.738] (EE) 
[    46.738] (EE) Backtrace:
[    46.739] (EE) 0: /usr/bin/X (OsLookupColor+0x139) [0x5d3a20d6e3e9]
[    46.739] (EE) 1: /lib64/libpthread.so.0 (funlockfile+0x60) [0x774f7d311a90]
[    46.740] (EE) 2: /lib64/libc.so.6 (gsignal+0x145) [0x774f7d16d7d5]
[    46.741] (EE) 3: /lib64/libc.so.6 (abort+0x127) [0x774f7d156895]
[    46.741] (EE) 4: /lib64/libc.so.6 (__assert_fail_base.cold+0xf) [0x774f7d156769]
[    46.742] (EE) 5: /lib64/libc.so.6 (__assert_fail+0x46) [0x774f7d165e86]
[    46.742] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.742] (EE) 6: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8de335]
[    46.742] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.742] (EE) 7: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8de6b2]
[    46.743] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.743] (EE) 8: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8ea81d]
[    46.743] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.743] (EE) 9: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8ec97a]
[    46.743] (EE) unw_get_proc_name failed: no unwind info found [-10]
[    46.743] (EE) 10: /usr/lib64/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x774f7c8ee657]
[    46.744] (EE) 11: /usr/bin/X (MapWindow+0x24c) [0x5d3a20c35b5c]
[    46.744] (EE) 12: /usr/bin/X (InitFonts+0x355) [0x5d3a20c0caa5]
[    46.744] (EE) 13: /lib64/libc.so.6 (__libc_start_main+0xf2) [0x774f7d158082]
[    46.745] (EE) 14: /usr/bin/X (_start+0x2e) [0x5d3a20bf5e6e]
[    46.745] (EE) 
[    46.745] (EE) 

Sweet, that is the same error I had before I switched over to xorg-x11-drv-amdgpu-21.0.0-1.fc33.x86_64. As I said in the above mentioned issue though, it did not resolve any problems other than your mentioned error. That means that it's just not my setup that's causing this. I will downgrade and try without accelmethod none.

Odd that you get that error just by adding that extension though.

isodude commented 3 years ago

Removing AccelMethod made stock amdgpu drivers working again, trying this out with latest patches and I got a system where I could suspend/resume, you may need to suspend via lid once o twice because of amdgpu problems, but the system get back on its feets at least.

isodude commented 3 years ago

Does someone want to try to compile 5.14 with this at the end of config-qubes?

CONFIG_AMD_PMC=y
# CONFIG_HSA_AMD is not set
# CONFIG_CRYPTO_DEV_CCP is not set

Since AMD SEV is not in Xen right now, my guess is that it's best to leave CONFIG_CRYPTO_DEV_CCP unset, @marmarek: am I correct here?

And maybe # CONFIG_DRM_AMD_SECURE_DISPLAY is not set I'm a bit unsure if this actually does anything good.

I think this could have a great effect on the resumableness of 5.14.

johnnyboy-3 commented 3 years ago

Hi, if i didn't made any mistake during configuration/compiling, resume is still broken. My guess is the settings aren't the one you suggested (not set = n?)

config-5.14.9-1.fc32.qubes.x86_64:

CONFIG_AMD_PMC=m
CONFIG_HSA_AMD=y
CONFIG_CRYPTO_DEV_CCP=y
CONFIG_DRM_AMD_SECURE_DISPLAY=y

journal:

dom0 kernel: ------------[ cut here ]------------
dom0 kernel: WARNING: CPU: 3 PID: 4295 at include/drm/ttm/ttm_bo_api.h:580 amdgpu_bo_unpin+0x5a/0xa0 [amdgpu]
dom0 kernel: Modules linked in: nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib snd_hda_codec_realtek mac80211 snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec cfg80211 snd_hda_core rfkill snd_hwdep libarc4 snd_seq snd_seq_device snd_pcm snd_timer snd joydev soundcore intel_rapl_msr intel_rapl_common wmi_bmof pcspkr r8169 sp5100_tco video i2c_piix4 k10temp wmi gpio_amdpt gpio_generic xenfs fuse ip_tables dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt trusted asn1_encoder amdgpu crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper cec drm ccp xhci_pci xhci_pci_renesas xhci_hcd xen_acpi_processor xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
dom0 kernel: CPU: 3 PID: 4295 Comm: kworker/3:4 Tainted: G        W         5.14.9-1.fc32.qubes.x86_64 #1
dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019
dom0 kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
dom0 kernel: RIP: e030:amdgpu_bo_unpin+0x5a/0xa0 [amdgpu]
dom0 kernel: Code: 75 25 48 8b bd 48 01 00 00 48 85 ff 74 05 e8 3d 52 5f c1 48 8b 85 c0 01 00 00 8b 40 10 83 f8 02 74 24 83 f8 01 74 0d 5b 5d c3 <0f> 0b 8b 85 04 02 00 00 eb ca 48 8b 85 30 01 00 00 f0 48 29 83 50
dom0 kernel: RSP: e02b:ffffc90042173cb0 EFLAGS: 00010246
dom0 kernel: RAX: 0000000000000000 RBX: ffff88810bfe5288 RCX: 0000000000000000
dom0 kernel: RDX: ffff8881047b27c0 RSI: 0000000000000000 RDI: ffff88810d807800
dom0 kernel: RBP: ffff88810d807800 R08: 0000000000000003 R09: 000000000036d468
dom0 kernel: R10: ffffc90042173ad8 R11: ffffffff82947168 R12: ffff88810bfe5288
dom0 kernel: R13: ffff88810d807800 R14: ffff888016b7f400 R15: 0000000000000000
dom0 kernel: FS:  0000000000000000(0000) GS:ffff8881272c0000(0000) knlGS:0000000000000000
dom0 kernel: CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
dom0 kernel: CR2: 000055aece260078 CR3: 0000000101acc000 CR4: 0000000000050660
dom0 kernel: Call Trace:
dom0 kernel:  amdgpu_gart_table_vram_unpin+0x54/0xc0 [amdgpu]
dom0 kernel:  gmc_v9_0_hw_fini+0x5f/0x80 [amdgpu]
dom0 kernel:  amdgpu_device_ip_suspend_phase2+0xc5/0x150 [amdgpu]
dom0 kernel:  amdgpu_device_ip_suspend+0x32/0x60 [amdgpu]
dom0 kernel:  amdgpu_device_pre_asic_reset+0xa8/0x250 [amdgpu]
dom0 kernel:  amdgpu_device_gpu_recover.cold+0x53d/0x78e [amdgpu]
dom0 kernel:  amdgpu_job_timedout+0x17a/0x1a0 [amdgpu]
dom0 kernel:  drm_sched_job_timedout+0x74/0x110 [gpu_sched]
dom0 kernel:  process_one_work+0x1ec/0x390
dom0 kernel:  worker_thread+0x4a/0x320
dom0 kernel:  ? process_one_work+0x390/0x390
dom0 kernel:  kthread+0x10f/0x130
dom0 kernel:  ? set_kthread_struct+0x40/0x40
dom0 kernel:  ret_from_fork+0x22/0x30
dom0 kernel: ---[ end trace 98343ebdb58f3e7d ]---
dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume
dom0 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15dd
dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset(2) failed
dom0 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
dom0 kernel: kfd kfd: amdgpu: error getting iommu info. is the iommu enabled?
dom0 kernel: kfd kfd: amdgpu: Error initializing iommuv2
dom0 kernel: kfd kfd: amdgpu: device 1002:15dd NOT added due to errors
dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset end with ret = -6
dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

I can try a recompilation soon.

isodude commented 3 years ago

Hi, if i didn't made any mistake during configuration/compiling, resume is still broken. My guess is the settings aren't the one you suggested (not set = n?)

You should actually enter it exactly like I wrote, a bit confusing I know :) They should be set to =n eventually. To set it to no it actually have to be # CONFIG_OPTION is not set (watch out for that space after the initial #.

johnnyboy-3 commented 3 years ago

Hi, I think you misunderstood. I copy & pasted your snipped to the end of config-qubes. My excerpt is from /boot/config-5.14.9-1.fc32.qubes.x86_64, after I installed the linux-kernel rpm file. The qubes-builder or Makefile must have set CONFIG_AMD_PMC to m instead of y and ignored the rest (or took other defaults). My question was: did you mean n for the commented lines?

edit: I'm pretty sure the error is on my side here.

isodude commented 3 years ago

So, you're only running make linux-kernel after those changes right? It sounds like you're running make get-sources and that it gets overridden.

Well, there are no such thing as =n, it's 'is not set'.

During the start of the build you can actually see when it is setting the override configs, also possible to enter the chroot and look up the .config file inside the unpacked kernel.

Anyhow I booted the kernel with CONFIG_HSA_AMD, CONFIG_CRYPTO_DEV_CCP off and CONFIG_AMD_PMC on. Not much difference sadly, i'm messing about with the dpm performance level and that actually makes the resume behave differently depending on which values you set.

I found out that there's the amdgpu.msi=0, such that setting that on the kernel commandline you can sideline only amdgpu's msi, not the rest of the system. This may be how I solve things right now instead of carrying fat patches. At least in 5.15+.

I started to patch amdgpu_irq.c, and got some results, it seems that they carry their own version of some fiddling with MSI, but Xen wants to do that. So I don't know.

Maybe set fw_load_type=1 (SMU instead of PSP) on the kernel cmdline, much of all the trouble is from PSP trying to load firmware, and PSP is pretty tied up with CCP/TEE.

Thanks for giving it a try, I'm about to give up any moment now :sweat_smile:

johnnyboy-3 commented 3 years ago

Hi, I followed your conversations with the devs a bit and was wondering how the patches work out for you.

Greetings

isodude commented 3 years ago

I got some attention from Thomas here: https://lore.kernel.org/linux-pci/87ee7w6bxi.ffs@tglx/

Since I fixed the xorg amgpu drivers, there's no glitch while resuming. When resume works (v5.15 kernel), it seems to do pretty decent. It crashes completely after 10 seconds about though. So I hope that the above thread will give some insight into that. Otherwise I will open a bug with drm/amd I guess.

So during the wait, I updated the kernel again, hit a new xen-bug, got it fixed:

On 11/5/21 16:17, Josef Per Johansson wrote:
>
>
> ------------------------------------------------------------------------
> *From:* Peter Zijlstra <peterz@infradead.org>
> *Sent:* Friday, 5 November 2021 14:00
> *To:* Josef Johansson
> *Cc:* Barry Song; Tim Chen; Thomas Gleixner; x86@kernel.org
> *Subject:* Re: [REGRESSION][BISECTED] sched: Add cluster scheduler
> level in core and related Kconfig for ARM64
>
> On Fri, Nov 05, 2021 at 01:50:26PM +0100, Peter Zijlstra wrote:
>> On Fri, Nov 05, 2021 at 09:04:51AM +0100, Josef Johansson wrote:
>>> On 11/5/21 08:41, Peter Zijlstra wrote:
>>>> On Fri, Nov 05, 2021 at 07:28:33AM +0100, Josef Johansson wrote:
>>
>>>>> If I can provide any useful details please let me know.
>>>>> [193.805738] Call Trace:
>>>>> [193.805738] xen_pv_smp_prepare_cpus+Ox137/0x2d7
>>
>>>> Wait,... are you running Xen ?
>>
>>> Oh, yes. I'm running Xen, forgot to mention that. 
>>> I'm running Qubes OS R4.1.
>>
>> OK, lemme go try and figure out what xen does weird.
>
> Does this help?
>
> ---
> diff --git a/arch/x86/xen/smp_pv.c b/arch/x86/xen/smp_pv.c
> index 7ed56c6075b0..72742914dd5a 100644
> --- a/arch/x86/xen/smp_pv.c
> +++ b/arch/x86/xen/smp_pv.c
> @@ -246,6 +246,7 @@ static void __init
> xen_pv_smp_prepare_cpus(unsigned int max_cpus)
> zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
> zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL);
> zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
> + zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
> }
> set_cpu_sibling_map(0);
>
> Thanks! It seems to have done the trick.
>
> I am currently compiling without preemptions since that triggered
> another xen related bug.
> is_xen_pmu uses this_cpu_ptr with preemption enabled.
>
> I'll will let you know how that went.
>
> Regards
> Josef
I tested out the patch properly now. You can add my Tested-By:
josef@oderland.se.

Regards
Josef
diff --git a/arch/x86/xen/smp_pv.c b/arch/x86/xen/smp_pv.c

---
diff --git a/arch/x86/xen/smp_pv.c b/arch/x86/xen/smp_pv.c
index 7ed56c6075b0..72742914dd5a 100644
--- a/arch/x86/xen/smp_pv.c
+++ b/arch/x86/xen/smp_pv.c
@@ -246,6 +246,7 @@ static void __init xen_pv_smp_prepare_cpus(unsigned int max_cpus)
        zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
        zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL);
        zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+       zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
    }
    set_cpu_sibling_map(0);

And now I'm in a new bisect because suspend is completely broken sometime after this.

So right now I'm waiting a bit for Thomas and the gang to give some hint. I think that it may very well be related.

On an unrelated note: I happened to try out preempt, which totally broke everything during suspend. There's a bug as well!

smpboot: CPU 4 is now offline
smpboot: CPU 5 is now offline
smpboot: CPU 6 is now offline
smpboot: CPU 7 is now offline
ACPI: PM: Low-level resume complete
ACPI: EC: EC started
ACPI: PM: Restoring platform NVS memory
Enabling non-boot CPUs ...
installing Xen timer for CPU 1
BUG: using smp_processor_id() in preemptible [00000000] code: systemd-sleep/26742
caller is is_xen_pmu+0x12/0x30
CPU: 0 PID: 26742 Comm: systemd-sleep Tainted: G        W         5.16.0-0.rc0.0.fc32.qubes.x86_64 #1
Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET65W(1.34 ) 06/17/2021
Call Trace:
 dump_stack_lvl+0x46/0x5a
 check_preemption_disabled+0xde/0xe0
 is_xen_pmu+0x12/0x30
 xen_smp_intr_init_pv+0x75/0x100
 ? pfn_pte+0x90/0x90
 xen_cpu_up_prepare_pv+0x3e/0x90
 cpuhp_invoke_callback+0x2ba/0x460
 ? _raw_spin_unlock_irq+0x1d/0x30
 cpuhp_up_callbacks+0x4b/0x170
 _cpu_up+0xba/0x140
 thaw_secondary_cpus.cold+0x50/0xaa
 suspend_enter+0x115/0x3a0
 suspend_devices_and_enter+0x133/0x280
 enter_state+0x125/0x176
 pm_suspend.cold+0x20/0x6b
 state_store+0x27/0x50
 kernfs_fop_write_iter+0x124/0x1b0
 new_sync_write+0x15c/0x1f0
 vfs_write+0x20d/0x2a0
 ksys_write+0x67/0xe0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x753dfe41a2f7
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffe091aa078 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000753dfe41a2f7
RDX: 0000000000000004 RSI: 00007ffe091aa160 RDI: 0000000000000004
RBP: 00007ffe091aa160 R08: 000063792f42fca0 R09: 000000000000000d
R10: 000063792f42beb0 R11: 0000000000000246 R12: 0000000000000004
R13: 000063792f42b2d0 R14: 0000000000000004 R15: 0000753dfe4ec700
xen_acpi_processor: Uploading Xen processor PM info
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU1
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU3
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU5
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU7
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU9
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU11
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU13
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU15
cpu 1 spinlock event irq 67
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
ACPI: \_SB_.PLTF.C001: Found 3 idle states
ACPI: FW issue: working around C-state latencies out of order
CPU1 is up
installing Xen timer for CPU 2
isodude commented 2 years ago

Ok, so a bit mixed bag right now.

I created an issue over at amdgpu Failed to terminate hdcp ta during suspend (s3) on Xen (https://gitlab.freedesktop.org/drm/amd/-/issues/1827) which will hopefully gain some attraction.

Kernel 5.16-rc6 have no issues booting :) I updated my linux-kernel-tree with devel-5.16 branch.

I am build a replica of Qubes in Arch instead, installing Xen etc. But I'm using the latest git instead. Right now I'm testing out 5.16-rc6 and Xen 4.15.1 with Mesa 22. Suspend/Resume is still no go though. It seems though as the problems outline here are solved but others have surfaced within Mesa.

[drm] psp command (0x7) failed and response status is (0xFFFF0007)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmp failed!

Is not present with Xen 4.15.1, so something got solved. Maybe I can bisect this. Xen is easier to recompile than the kernel..

During my Arch build I made an observation that if I didn't properly load the wireless drivers, resume would not go through, at all. The problem with it not properly loading is my own doing, but still.

So right now I'm trying to build llvm-git to see if anything has been solved upstream, I will send an issue to mesa regarding amdgpu_cs_ioctl: Failed to initialize parser -125 which I got during resume in Arch now. There are some patchworks that I can try. If I get a working setup there maybe we can backtrack that into current qubes packages.

@ydirson sent an interesting thread https://www.spinics.net/lists/amd-gfx/msg71034.html regarding vga-passthrough on Qubes. What I got from it was that is that I can't disable the CCP but I need it alive such that PSP can install firmware. The SMU is not used for that at all with Renoir. So I changed this in config-qubes such that AMD_PMC=y only, also leaving AMD_HSA compiled in.

Holler if you find anything interesting.

k4z4n0v4 commented 2 years ago

Is there any progress on this? I think i have a similar problem with Ryzen 7 Pro 5850U even under the kernel-latest (Linux dom0 5.15.14-1.fc32.qubes.x86_64). Attaching logs for resume failing on Qubes R4.1 and it succeeding on Kali (Linux kali 5.14.0-kali4-amd64) booted from USB on the same machine

Kali journal on successful sleep and wake up

Feb 15 10:40:54 kali unknown: Starting suspend at 10:40:54
Feb 15 10:40:54 kali kernel: PM: suspend entry (deep)
Feb 15 10:40:54 kali kernel: Filesystems sync: 0.000 seconds
Feb 15 10:40:54 kali kernel: (NULL device *): firmware: direct-loading firmware regulatory.db
Feb 15 10:40:54 kali kernel: (NULL device *): firmware: direct-loading firmware regulatory.db.p7s
Feb 15 10:41:04 kali kernel: (NULL device *): firmware: direct-loading firmware iwlwifi-cc-a0-63.ucode
Feb 15 10:41:04 kali kernel: Freezing user space processes ... (elapsed 0.031 seconds) done.
Feb 15 10:41:04 kali kernel: OOM killer disabled.
Feb 15 10:41:04 kali kernel: Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Feb 15 10:41:04 kali kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Feb 15 10:41:04 kali kernel: [drm] free PSP TMR buffer
Feb 15 10:41:04 kali kernel: ACPI: EC: interrupt blocked
Feb 15 10:41:04 kali kernel: ACPI: PM: Preparing to enter system sleep state S3
Feb 15 10:41:04 kali kernel: ACPI: EC: event blocked
Feb 15 10:41:04 kali kernel: ACPI: EC: EC stopped
Feb 15 10:41:04 kali kernel: ACPI: PM: Saving platform NVS memory
Feb 15 10:41:04 kali kernel: Disabling non-boot CPUs ...
Feb 15 10:41:04 kali kernel: smpboot: CPU 1 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 2 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 3 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 4 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 5 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 6 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 7 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 8 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 9 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 10 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 11 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 12 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 13 is now offline
Feb 15 10:41:04 kali kernel: smpboot: CPU 14 is now offline
Feb 15 10:41:04 kali kernel: Spectre V2 : Update user space SMT mitigation: STIBP off
Feb 15 10:41:04 kali kernel: smpboot: CPU 15 is now offline
Feb 15 10:41:04 kali kernel: ACPI: PM: Low-level resume complete
Feb 15 10:41:04 kali kernel: ACPI: EC: EC started
Feb 15 10:41:04 kali kernel: ACPI: PM: Restoring platform NVS memory
Feb 15 10:41:04 kali kernel: LVT offset 0 assigned for vector 0x400
Feb 15 10:41:04 kali kernel: Enabling non-boot CPUs ...
Feb 15 10:41:04 kali kernel: x86: Booting SMP configuration:
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 1 APIC 0x1
Feb 15 10:41:04 kali kernel: microcode: CPU1: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C001: Found 3 idle states
Feb 15 10:41:04 kali kernel: Spectre V2 : Update user space SMT mitigation: STIBP always-on
Feb 15 10:41:04 kali kernel: CPU1 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 2 APIC 0x2
Feb 15 10:41:04 kali kernel: microcode: CPU2: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C002: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU2 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 3 APIC 0x3
Feb 15 10:41:04 kali kernel: microcode: CPU3: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C003: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU3 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 4 APIC 0x4
Feb 15 10:41:04 kali kernel: microcode: CPU4: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C004: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU4 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 5 APIC 0x5
Feb 15 10:41:04 kali kernel: microcode: CPU5: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C005: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU5 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 6 APIC 0x6
Feb 15 10:41:04 kali kernel: microcode: CPU6: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C006: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU6 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 7 APIC 0x7
Feb 15 10:41:04 kali kernel: microcode: CPU7: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C007: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU7 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 8 APIC 0x8
Feb 15 10:41:04 kali kernel: microcode: CPU8: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C008: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU8 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 9 APIC 0x9
Feb 15 10:41:04 kali kernel: microcode: CPU9: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C009: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU9 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 10 APIC 0xa
Feb 15 10:41:04 kali kernel: microcode: CPU10: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C00A: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU10 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 11 APIC 0xb
Feb 15 10:41:04 kali kernel: microcode: CPU11: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C00B: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU11 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 12 APIC 0xc
Feb 15 10:41:04 kali kernel: microcode: CPU12: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C00C: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU12 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 13 APIC 0xd
Feb 15 10:41:04 kali kernel: microcode: CPU13: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C00D: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU13 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 14 APIC 0xe
Feb 15 10:41:04 kali kernel: microcode: CPU14: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C00E: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU14 is up
Feb 15 10:41:04 kali kernel: smpboot: Booting Node 0 Processor 15 APIC 0xf
Feb 15 10:41:04 kali kernel: microcode: CPU15: patch_level=0x0a50000c
Feb 15 10:41:04 kali kernel: ACPI: \_SB_.PLTF.C00F: Found 3 idle states
Feb 15 10:41:04 kali kernel: CPU15 is up
Feb 15 10:41:04 kali kernel: ACPI: PM: Waking up from system sleep state S3
Feb 15 10:41:04 kali kernel: ACPI: EC: interrupt unblocked
Feb 15 10:41:04 kali kernel: ACPI: EC: event unblocked
Feb 15 10:41:04 kali kernel: pci 0000:00:00.2: can't derive routing for PCI INT A
Feb 15 10:41:04 kali kernel: pci 0000:00:00.2: PCI INT A: no GSI
Feb 15 10:41:04 kali kernel: [drm] PCIE GART of 1024M enabled.
Feb 15 10:41:04 kali kernel: [drm] PTB located at 0x000000F400900000
Feb 15 10:41:04 kali kernel: [drm] PSP is resuming...
Feb 15 10:41:04 kali kernel: usb usb1: root hub lost power or was reset
Feb 15 10:41:04 kali kernel: usb usb2: root hub lost power or was reset
Feb 15 10:41:04 kali kernel: xhci_hcd 0000:06:00.0: Zeroing 64bit base registers, expecting fault
Feb 15 10:41:04 kali kernel: usb usb7: root hub lost power or was reset
Feb 15 10:41:04 kali kernel: [drm] reserve 0x400000 from 0xf43f800000 for PSP TMR
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: dpm has been disabled
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
Feb 15 10:41:04 kali kernel: [drm] kiq ring mec 2 pipe 1 q 0
Feb 15 10:41:04 kali kernel: [drm] DMUB hardware initialized: version=0x01010019
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dsc_pg_control line:363
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dsc_pg_control line:371
Feb 15 10:41:04 kali kernel: nvme nvme0: 16/0/0 default/read/poll queues
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dsc_pg_control line:379
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:434
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:508
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:442
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:516
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:450
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:524
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:458
Feb 15 10:41:04 kali kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:532
Feb 15 10:41:04 kali kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Feb 15 10:41:04 kali kernel: [drm] JPEG decode initialized successfully.
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
Feb 15 10:41:04 kali kernel: amdgpu 0000:07:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
Feb 15 10:41:04 kali kernel: usb 1-2: reset high-speed USB device number 2 using xhci_hcd
Feb 15 10:41:04 kali kernel: psmouse serio1: synaptics: queried max coordinates: x [..5678], y [..4694]
Feb 15 10:41:04 kali kernel: psmouse serio1: synaptics: queried min coordinates: x [1266..], y [1162..]
Feb 15 10:41:04 kali kernel: OOM killer enabled.
Feb 15 10:41:04 kali kernel: Restarting tasks ... 
Feb 15 10:41:04 kali kernel: pci_bus 0000:01: Allocating resources
Feb 15 10:41:04 kali kernel: pci_bus 0000:02: Allocating resources
Feb 15 10:41:04 kali kernel: pci_bus 0000:03: Allocating resources
Feb 15 10:41:04 kali kernel: pci_bus 0000:04: Allocating resources
Feb 15 10:41:04 kali kernel: pci_bus 0000:05: Allocating resources
Feb 15 10:41:04 kali kernel: pci_bus 0000:06: Allocating resources
Feb 15 10:41:04 kali kernel: pci_bus 0000:07: Allocating resources
Feb 15 10:41:04 kali kernel: done.
Feb 15 10:41:04 kali kernel: PM: suspend exit
Feb 15 10:41:04 kali kernel: Generic FE-GE Realtek PHY r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC)
Feb 15 10:41:04 kali kernel: r8169 0000:02:00.0 eth0: Link is Down
Feb 15 10:41:04 kali kernel: Generic FE-GE Realtek PHY r8169-0-500:00: attached PHY driver (mii_bus:phy_addr=r8169-0-500:00, irq=MAC)
Feb 15 10:41:04 kali kernel: r8169 0000:05:00.0 eth1: Link is Down

Qubes journal of unsuccessful sleep and wake up with kernel-latest.

Feb 15 14:08:25 dom0 unknown: Starting suspend at 14:08:25
Feb 15 14:08:27 dom0 kernel: audit: type=1130 audit(1644919707.518:279): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=qubes-suspend comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 15 14:08:27 dom0 kernel: PM: suspend entry (deep)
Feb 15 14:08:27 dom0 kernel: Filesystems sync: 0.001 seconds
Feb 15 14:09:05 dom0 kernel: Freezing user space processes ... (elapsed 0.001 seconds) done.
Feb 15 14:09:05 dom0 kernel: OOM killer disabled.
Feb 15 14:09:05 dom0 kernel: Freezing remaining freezable tasks ... (elapsed 0.098 seconds) done.
Feb 15 14:09:05 dom0 kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Feb 15 14:09:05 dom0 kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Feb 15 14:09:05 dom0 kernel: [drm] free PSP TMR buffer
Feb 15 14:09:05 dom0 kernel: [drm] psp gfx command DESTROY_TMR(0x7) failed and response status is (0x80000306)
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: MODE2 reset
Feb 15 14:09:05 dom0 kernel: PM: suspend devices took 1.588 seconds
Feb 15 14:09:05 dom0 kernel: ACPI: EC: interrupt blocked
Feb 15 14:09:05 dom0 kernel: ACPI: PM: Preparing to enter system sleep state S3
Feb 15 14:09:05 dom0 kernel: ACPI: EC: event blocked
Feb 15 14:09:05 dom0 kernel: ACPI: EC: EC stopped
Feb 15 14:09:05 dom0 kernel: ACPI: PM: Saving platform NVS memory
Feb 15 14:09:05 dom0 kernel: Disabling non-boot CPUs ...
Feb 15 14:09:05 dom0 kernel: ------------[ cut here ]------------
Feb 15 14:09:05 dom0 kernel: WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:522 switch_mm_irqs_off+0x3c5/0x400
Feb 15 14:09:05 dom0 kernel: Modules linked in: loop vfat fat snd_soc_dmic snd_acp3x_pdm_dma snd_acp3x_rn snd_soc_core intel_rapl_msr snd_compress ac97_bus snd_pcm_dmaengine think_lmi wmi_bmof firmware_attributes_class intel_rapl_common joydev snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device iwlwifi snd_pcm snd_pci_acp5x sp5100_tco snd_rn_pci_acp3x k10temp cfg80211 snd_timer snd_pci_acp3x i2c_piix4 r8169 thinkpad_acpi ucsi_acpi platform_profile ledtrig_audio typec_ucsi rfkill typec wmi snd soundcore video i2c_scmi fuse xenfs ip_tables dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt trusted asn1_encoder amdgpu drm_ttm_helper rtsx_pci_sdmmc ttm mmc_core iommu_v2 crct10dif_pclmul gpu_sched crc32_pclmul i2c_algo_bit crc32c_intel drm_kms_helper cec ccp ghash_clmulni_intel xhci_pci serio_raw xhci_pci_renesas drm xhci_hcd nvme rtsx_pci nvme_core xen_acpi_processor
Feb 15 14:09:05 dom0 kernel:  xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
Feb 15 14:09:05 dom0 kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W         5.15.14-1.fc32.qubes.x86_64 #1
Feb 15 14:09:05 dom0 kernel: Hardware name: LENOVO 20XK0019US/20XK0019US, BIOS R1MET43W (1.13 ) 11/05/2021
Feb 15 14:09:05 dom0 kernel: RIP: e030:switch_mm_irqs_off+0x3c5/0x400
Feb 15 14:09:05 dom0 kernel: Code: f0 41 80 65 01 fb ba 01 00 00 00 49 8d b5 68 23 00 00 4c 89 ef 49 c7 85 70 23 00 00 e0 1d 08 81 e8 50 f7 08 00 e9 15 fd ff ff <0f> 0b e8 34 fa ff ff e9 ad fc ff ff 0f 0b e9 31 fe ff ff 0f 0b e9
Feb 15 14:09:05 dom0 kernel: RSP: e02b:ffffc900400f3eb8 EFLAGS: 00010006
Feb 15 14:09:05 dom0 kernel: RAX: 0000000107db0000 RBX: ffff888139070000 RCX: 0000000000000040
Feb 15 14:09:05 dom0 kernel: RDX: ffff8881003027c0 RSI: 0000000000000000 RDI: ffff888187db0000
Feb 15 14:09:05 dom0 kernel: RBP: ffffffff829d9240 R08: 0000000000000000 R09: 0000000000000000
Feb 15 14:09:05 dom0 kernel: R10: 0000000000000008 R11: 0000000000000000 R12: ffff88810b4b61c0
Feb 15 14:09:05 dom0 kernel: R13: ffff8881003027c0 R14: 0000000000000000 R15: 0000000000000001
Feb 15 14:09:05 dom0 kernel: FS:  0000000000000000(0000) GS:ffff888139040000(0000) knlGS:0000000000000000
Feb 15 14:09:05 dom0 kernel: CS:  10000e030 DS: 002b ES: 002b CR0: 0000000080050033
Feb 15 14:09:05 dom0 kernel: CR2: 0000748bd6338000 CR3: 0000000002810000 CR4: 0000000000050660
Feb 15 14:09:05 dom0 kernel: Call Trace:
Feb 15 14:09:05 dom0 kernel:  <TASK>
Feb 15 14:09:05 dom0 kernel:  switch_mm+0x1c/0x30
Feb 15 14:09:05 dom0 kernel:  play_dead_common+0xa/0x20
Feb 15 14:09:05 dom0 kernel:  xen_pv_play_dead+0xa/0x60
Feb 15 14:09:05 dom0 kernel:  do_idle+0xd1/0xe0
Feb 15 14:09:05 dom0 kernel:  cpu_startup_entry+0x19/0x20
Feb 15 14:09:05 dom0 kernel:  asm_cpu_bringup_and_idle+0x5/0x1000
Feb 15 14:09:05 dom0 kernel:  </TASK>
Feb 15 14:09:05 dom0 kernel: ---[ end trace 60a8d743d9766257 ]---
Feb 15 14:09:05 dom0 kernel: smpboot: CPU 1 is now offline
Feb 15 14:09:05 dom0 kernel: smpboot: CPU 2 is now offline
Feb 15 14:09:05 dom0 kernel: smpboot: CPU 3 is now offline
Feb 15 14:09:05 dom0 kernel: smpboot: CPU 4 is now offline
Feb 15 14:09:05 dom0 kernel: smpboot: CPU 5 is now offline
Feb 15 14:09:05 dom0 kernel: smpboot: CPU 6 is now offline
Feb 15 14:09:05 dom0 kernel: smpboot: CPU 7 is now offline
Feb 15 14:09:05 dom0 kernel: ACPI: PM: Low-level resume complete
Feb 15 14:09:05 dom0 kernel: ACPI: EC: EC started
Feb 15 14:09:05 dom0 kernel: ACPI: PM: Restoring platform NVS memory
Feb 15 14:09:05 dom0 kernel: xen_acpi_processor: Uploading Xen processor PM info
Feb 15 14:09:05 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU1
Feb 15 14:09:05 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU3
Feb 15 14:09:05 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU5
Feb 15 14:09:05 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU7
Feb 15 14:09:05 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU9
Feb 15 14:09:05 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU11
Feb 15 14:09:05 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU13
Feb 15 14:09:05 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU15
Feb 15 14:09:05 dom0 kernel: Enabling non-boot CPUs ...
Feb 15 14:09:05 dom0 kernel: installing Xen timer for CPU 1
Feb 15 14:09:05 dom0 kernel: cpu 1 spinlock event irq 67
Feb 15 14:09:05 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Feb 15 14:09:05 dom0 kernel: ACPI: \_SB_.PLTF.C001: Found 3 idle states
Feb 15 14:09:05 dom0 kernel: CPU1 is up
Feb 15 14:09:05 dom0 kernel: installing Xen timer for CPU 2
Feb 15 14:09:05 dom0 kernel: cpu 2 spinlock event irq 73
Feb 15 14:09:05 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Feb 15 14:09:05 dom0 kernel: ACPI: \_SB_.PLTF.C002: Found 3 idle states
Feb 15 14:09:05 dom0 kernel: CPU2 is up
Feb 15 14:09:05 dom0 kernel: installing Xen timer for CPU 3
Feb 15 14:09:05 dom0 kernel: cpu 3 spinlock event irq 79
Feb 15 14:09:05 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Feb 15 14:09:05 dom0 kernel: ACPI: \_SB_.PLTF.C003: Found 3 idle states
Feb 15 14:09:05 dom0 kernel: CPU3 is up
Feb 15 14:09:05 dom0 kernel: installing Xen timer for CPU 4
Feb 15 14:09:05 dom0 kernel: cpu 4 spinlock event irq 85
Feb 15 14:09:05 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Feb 15 14:09:05 dom0 kernel: ACPI: \_SB_.PLTF.C004: Found 3 idle states
Feb 15 14:09:05 dom0 kernel: CPU4 is up
Feb 15 14:09:05 dom0 kernel: installing Xen timer for CPU 5
Feb 15 14:09:05 dom0 kernel: cpu 5 spinlock event irq 91
Feb 15 14:09:05 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Feb 15 14:09:05 dom0 kernel: ACPI: \_SB_.PLTF.C005: Found 3 idle states
Feb 15 14:09:05 dom0 kernel: CPU5 is up
Feb 15 14:09:05 dom0 kernel: installing Xen timer for CPU 6
Feb 15 14:09:05 dom0 kernel: cpu 6 spinlock event irq 97
Feb 15 14:09:05 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Feb 15 14:09:05 dom0 kernel: ACPI: \_SB_.PLTF.C006: Found 3 idle states
Feb 15 14:09:05 dom0 kernel: CPU6 is up
Feb 15 14:09:05 dom0 kernel: installing Xen timer for CPU 7
Feb 15 14:09:05 dom0 kernel: cpu 7 spinlock event irq 103
Feb 15 14:09:05 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Feb 15 14:09:05 dom0 kernel: ACPI: \_SB_.PLTF.C007: Found 3 idle states
Feb 15 14:09:05 dom0 kernel: CPU7 is up
Feb 15 14:09:05 dom0 kernel: ACPI: PM: Waking up from system sleep state S3
Feb 15 14:09:05 dom0 kernel: ACPI: EC: interrupt unblocked
Feb 15 14:09:05 dom0 kernel: ACPI: EC: event unblocked
Feb 15 14:09:05 dom0 kernel: [drm] PCIE GART of 1024M enabled.
Feb 15 14:09:05 dom0 kernel: [drm] PTB located at 0x000000F400900000
Feb 15 14:09:05 dom0 kernel: [drm] PSP is resuming...
Feb 15 14:09:05 dom0 kernel: [drm] reserve 0x400000 from 0xf43f800000 for PSP TMR
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: dpm has been disabled
Feb 15 14:09:05 dom0 kernel: nvme nvme0: 8/0/0 default/read/poll queues
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
Feb 15 14:09:05 dom0 kernel: [drm] kiq ring mec 2 pipe 1 q 0
Feb 15 14:09:05 dom0 kernel: [drm] DMUB hardware initialized: version=0x0101001C
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dsc_pg_control line:363
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dsc_pg_control line:371
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dsc_pg_control line:379
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:434
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:508
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:442
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:516
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:450
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:524
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:458
Feb 15 14:09:05 dom0 kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:532
Feb 15 14:09:05 dom0 kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Feb 15 14:09:05 dom0 kernel: [drm] JPEG decode initialized successfully.
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
Feb 15 14:09:05 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
Feb 15 14:09:05 dom0 kernel: PM: resume devices took 0.332 seconds
Feb 15 14:09:05 dom0 kernel: OOM killer enabled.
Feb 15 14:09:05 dom0 kernel: Restarting tasks ... done.
Feb 15 14:09:05 dom0 kernel: PM: suspend exit
Feb 15 14:09:05 dom0 kernel: audit: type=1130 audit(1644919745.520:280): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-suspend comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 15 14:09:05 dom0 kernel: audit: type=1131 audit(1644919745.520:281): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-suspend comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 15 14:09:05 dom0 kernel: psmouse serio1: synaptics: queried max coordinates: x [..5678], y [..4694]
Feb 15 14:09:06 dom0 kernel: psmouse serio1: synaptics: queried min coordinates: x [1266..], y [1162..]
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:8 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x12 (VMC)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00043811
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: VCN (0x1c)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x1
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x1
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x1
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:173 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000001000 from IH client 0x12 (VMC)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: MP1 (0x0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:173 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000003000 from IH client 0x12 (VMC)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: MP1 (0x0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000004000 from IH client 0x12 (VMC)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0004395B
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: VCN (0x1c)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x1
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x5
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x1
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x1
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:173 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000005000 from IH client 0x12 (VMC)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: MP1 (0x0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:173 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000007000 from IH client 0x12 (VMC)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: MP1 (0x0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000008000 from IH client 0x12 (VMC)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: MP1 (0x0)
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x0
Feb 15 14:09:06 dom0 kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Feb 15 14:09:06 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=12198, emitted seq=12200
Feb 15 14:09:06 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 5665 thread X:cs0 pid 5686
Feb 15 14:09:06 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
Feb 15 14:09:06 dom0 kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
Feb 15 14:09:06 dom0 kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
Feb 15 14:09:08 dom0 kernel: [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x0)
Feb 15 14:09:09 dom0 kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Feb 15 14:09:09 dom0 kernel: [drm] free PSP TMR buffer
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: MODE2 reset
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
Feb 15 14:09:09 dom0 kernel: [drm] PCIE GART of 1024M enabled.
Feb 15 14:09:09 dom0 kernel: [drm] PTB located at 0x000000F400900000
Feb 15 14:09:09 dom0 kernel: [drm] PSP is resuming...
Feb 15 14:09:09 dom0 kernel: [drm] reserve 0x400000 from 0xf43f800000 for PSP TMR
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
Feb 15 14:09:09 dom0 kernel: [drm] kiq ring mec 2 pipe 1 q 0
Feb 15 14:09:09 dom0 kernel: [drm] DMUB hardware initialized: version=0x0101001C
Feb 15 14:09:09 dom0 kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Feb 15 14:09:09 dom0 kernel: [drm] JPEG decode initialized successfully.
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow start
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow done
Feb 15 14:09:09 dom0 kernel: [drm] Skip scheduling IBs!
Feb 15 14:09:09 dom0 kernel: [drm] Skip scheduling IBs!
Feb 15 14:09:09 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(2) succeeded!
Feb 15 14:18:53 dom0 kernel: kauditd_printk_skb: 193 callbacks suppressed
Feb 15 14:18:53 dom0 kernel: audit: type=1130 audit(1644920333.770:283): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 15 14:18:53 dom0 kernel: audit: type=1131 audit(1644920333.770:284): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
isodude commented 2 years ago

I did not manage to bisect xen 4.15 - 4.14, I tried to apply various patches that seemed to be interesting but non that gave anything.

I have not touched the arch installation for a while now where I'm running Xen 4.15+ though, maybe things have been solved in mesa-git!

I'm currently running kernel 5.17 with the same issue as you have posted.

k4z4n0v4 commented 2 years ago

I'm currently running kernel 5.17 with the same issue as you have posted.

Kernel 5.17 on qubes os you mean? Because the sleep works normally on non-xen linux kernel lower version than that no?

isodude commented 2 years ago

I'm currently running kernel 5.17 with the same issue as you have posted.

Kernel 5.17 on qubes os you mean? Because the sleep works normally on non-xen linux kernel lower version than that no?

Exactly.

DemiMarie commented 2 years ago
CONFIG_SND_SOC_AMD_RENOIR=n
CONFIG_DRM_AMD_DC_HDCP=n
CONFIG_DRM_AMD_SECURE_DISPLAY=n
CONFIG_HSA_AMD_SVM=n
CONFIG_AMD_MEM_ENCRYPT=n

@marmarek Can we turn off the HDCP and SVM stuff?

marmarek commented 2 years ago

@marmarek Can we turn off the HDCP and SVM stuff?

I highly doubt it would help with S3 issue...

DemiMarie commented 2 years ago

@marmarek Can we turn off the HDCP and SVM stuff?

I highly doubt it would help with S3 issue...

Would turning off AMD_MEM_ENCRYPT help? Not that we should do it, obviously.

isodude commented 2 years ago

MEM_ENCRYPT is actually not used anyhow. You have to enable it as a kernel option.

https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html

mem_encrypt=      [X86-64] AMD Secure Memory Encryption  (SME) control
                                  Valid arguments: on, off Default (depends on kernel configuration option): on
                                   (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) off
                                   (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=n)
                                  mem_encrypt=on: Activate SME
                                  mem_encrypt=off: Do not activate SME 

The problem with Xen and AMD and S3 seems to be related to the fact that AMD has not QA for Xen and their APU . Also Vega and Mesa does not seem to work that good together either right now.

From my understanding you need all the PSP features to make the APU work properly, since this is the change in 4000 series, that the PSP loads the firmware.

Also this is not a single problem either, most problems in kernel 5.16 and lower is solved (desktop not entering sleep, desktop not coming up from sleep).

From what I can tell there's two problems left.

Problems with GPU reset because Xen and PSP does not work well together, I confirmed this was fixed in Xen 4.15+.

Xorg(Mesa I guess) does not handle BO during reset, or reset during unsuspend breaks havoc with BO(?) . Either this is kernel related as well or a bug inside Mesa. When trying Arch as a base I could not get past tmese graphic artifacts. Oddl enough the exact same errors happen on Vega when people play games. Something related to shaders. Maybe turning off 3D plus the latest of everyhing would work and help pin down the issue.