Screen does not wake up after resume (AMD Ryzen 7 Pro 4750U)

isodude commented 3 years ago

Solved as of

linux-firmware-20230123-135.fc32.noarch xen-4.14.5-20.fc32.x86_64 kernel-latest-6.2.10-1.qubes.fc32.x86_64

Qubes OS release

R4.1, kernel 5.14.7-1 (fedora 5.14) (same behavior in lower kernels.) XEN 4.14.3 (build from @marmarek branch)

Brief summary

Laptops does not resume after third sleep/resume cycle. The problem seems to be with

[drm] psp command (0x7) failed and response status is (0xFFFF0007)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmp failed!

It feels like there's a hung process in the amdgpu drivers for some reason.

Not sure how to debug this properly, XEN is not giving me much info at all. The problem is visible with X started as well obviously but I try to make the bug surface smaller.

Steps to reproduce

Boot laptop with X disabled, no VMs started. run systemctl suspend three times (and resuming) run reboot to restore system

Expected behavior

Possible to suspend limitless.

Actual behavior

Screen does not wake up on third resume. It's possible to write reboot and restart.

Notes

Works well with kernel booted without XEN. crash.filtered.log crash.filtered.xen.log

Workarounds

A bit more testing is needed but I do have sort of stable suspend/resume now. It even survives when everything goes south. There's a bit of tearing, but I'd rather have suspend than tearing.

cat << > /etc/X11/xorg.conf.d/50-video.conf 
Section "Device"
    Identifier "card0"
    Driver "amdgpu"
    Option "AccelMethod" "none"
EndSection

Compile xorg-x11-drv-amdgpu from https://github.com/freedesktop/xorg-xf86-video-amdgpu Run make install and install amdgpu_drv.so in /usr/lib64/xorg/modules/drivers on dom0.

For more stability run with kernel cmdline preempt=none

Do note that e.g. 4k external screen will be royally sluggish.

Sometimes the screen turns up black, type in the password anyhow and switch to tty2 and back again / suspend-resume again and it will most likely come to life again. Suspend/resume too fast could lead to instant reboot.

DemiMarie commented 2 years ago

Problems with GPU reset because Xen and PSP does not work well together, I confirmed this was fixed in Xen 4.15+.

@marmarek can we backport the fix?

Xorg(Mesa I guess) does not handle BO during reset, or reset during unsuspend breaks havoc with BO(?) . Either this is kernel related as well or a bug inside Mesa. When trying Arch as a base I could not get past tmese graphic artifacts. Oddl enough the exact same errors happen on Vega when people play games. Something related to shaders. Maybe turning off 3D plus the latest of everyhing would work and help pin down the issue.

I don’t think this is Qubes OS related. Please report it to Mesa upstream.

marmarek commented 2 years ago

@marmarek can we backport the fix?

I can look into it, but first I'd need to identify specific patch.

isodude commented 2 years ago

I can look into it, but first I'd need to identify specific patch.

I compiled the latest of everything yesterday to try to get the same, but obviously did not get the same results as last time around. I will try a bit more to get a good working sample.

What is odd that it does matter a whole lot in which order initramfs loads the firmware and kernel modules. In a certain order suspend will not work at all.

kuruczgy commented 2 years ago

Well it seems like I am dealing with this issue too :/

@isodude, do you have any known working qubes setup? From what I gathered in this thread xen 4.15 solves the PSP issue. Would there be any downside to simply running that? Are there any qubes specific patches needed, or can I just throw upstream xen 4.15 at qubes builder?

Xorg(Mesa I guess) does not handle BO during reset, or reset during unsuspend breaks havoc with BO(?)

I don’t think this is Qubes OS related. Please report it to Mesa upstream.

Any progress on this? Is there an upstream bug report? I haven't run into the issue yet, but if I ever manage to get past the PSP issue, I assume I will.

DemiMarie commented 2 years ago

Problems with GPU reset because Xen and PSP does not work well together, I confirmed this was fixed in Xen 4.15+.

Can you do a bisection @isodude?

k4z4n0v4 commented 2 years ago

Where did we get with this? Is there anything to be done on our side?

This is the only issue keeping me from switching to Qubes full-time, would really love to help get this resolved.

mcku commented 2 years ago

Hi, looking at the logs, this device is likely a Thinkpad T14 AMD Gen 1 or equivalent, which received a new firmware recently. I am wondering if it improves the situation or not. 0.1.41 which resolves some issues with sleep. link: https://support.lenovo.com/tr/en/downloads/ds544977-bios-update-utility-bootable-cd-for-windows-10-64-bit-thinkpad-t14-gen-1-types-20ud-20ue

my current device is a 20ue which cannot resume from sleep and i need a solution as well.

isodude commented 2 years ago

I tried with the newest firmware and kernel-latest (5.18) but I'm getting the infamous 'waiting for fences time out'.

It should be noted that the suspend/resume worked, I'm able to enter my password in the screensaver. It's after that moment that Xorg is having a hard time rendering and after a while gives up. It seems that the kernel manage to actually reset the GPU though.

Regarding the Mesa error, it seems that the error 'Failed to initialize parser -125!', is related to the fact that the GPU got a reset and lost it's VRAM which the DE does not recover from at all (https://bugzilla.kernel.org/show_bug.cgi?id=205089). So the real issue with GPU reset may still exist even in Xen 4.15.

An interesting take from that thread is that GPU resets was solved by fixing memcpy for one specific person: https://gist.github.com/jnettlet/f6f8b49bb7c731255c46f541f875f436 Could there be something like that happening in Xen + PSP? It would make sense that the data is somehow scrambled after resume and that trigger all sort of madness afterwards.

I tried to get suspend working in my arch install again, but it flat out doesn't work anymore. Also, bisecting Xen between 4.14 and 4.15 was horrendous since it didn't compile in the version in between for some reason. I tried to run Xen 4.15 on Qubes R4.1 before but it was a mess, it may be easier now?

isodude commented 2 years ago

I tested a bit more, suspend seems to work fine in console. Which is very good. There's still artifacts after suspend. Starting X after suspend makes X crash, if I roll back linux-firmware I avoid a newly introduced bug.

With linux-firmware-20211216-127 I get this error

Aug 03 03:05:25 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=6012, emitted seq=6014
Aug 03 03:05:25 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 3624 thread X:cs0 pid 3628
Aug 03 03:05:25 dom0 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
Aug 03 03:05:25 dom0 kernel: [drm] free PSP TMR buffer
Aug 03 03:05:25 dom0 kernel: CPU: 0 PID: 11 Comm: kworker/u16:1 Tainted: G        W         5.18.9-1.fc32.qubes.x86_64 #1
Aug 03 03:05:25 dom0 kernel: Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET72W(1.41 ) 06/27/2022
Aug 03 03:05:25 dom0 kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Aug 03 03:05:25 dom0 kernel: Call Trace:
Aug 03 03:05:25 dom0 kernel:  <TASK>
Aug 03 03:05:25 dom0 kernel:  dump_stack_lvl+0x45/0x5a
Aug 03 03:05:25 dom0 kernel:  amdgpu_reset_reg_dumps.isra.0+0x13/0x93 [amdgpu]
Aug 03 03:05:25 dom0 kernel:  amdgpu_do_asic_reset+0x27/0x3a6 [amdgpu]
Aug 03 03:05:25 dom0 kernel:  amdgpu_device_gpu_recover_imp.cold+0x524/0x65b [amdgpu]
Aug 03 03:05:25 dom0 kernel:  amdgpu_job_timedout+0x17a/0x1b0 [amdgpu]
Aug 03 03:05:25 dom0 kernel:  drm_sched_job_timedout+0x76/0x110 [gpu_sched]
Aug 03 03:05:25 dom0 kernel:  process_one_work+0x1e5/0x3b0
Aug 03 03:05:25 dom0 kernel:  worker_thread+0x49/0x2e0
Aug 03 03:05:25 dom0 kernel:  ? rescuer_thread+0x3a0/0x3a0
Aug 03 03:05:25 dom0 kernel:  kthread+0xe7/0x110
Aug 03 03:05:25 dom0 kernel:  ? kthread_complete_and_exit+0x20/0x20
Aug 03 03:05:25 dom0 kernel:  ret_from_fork+0x22/0x30
Aug 03 03:05:25 dom0 kernel:  </TASK>

There are some reports around this, specifically about cleaning up old jobs. Maybe that is worth investigating. 5.19-rc7 should work properly according to https://gitlab.freedesktop.org/drm/amd/-/issues/2050. I'll stop investigating for today though.

isodude commented 2 years ago

Tried with kernel 5.19, no change.

@marmarek Is Xen supposed to handle ccp/psp? If not ccp/psp both fail on my system when initializing them. The module get 0xffffffff.. back from reading the registries. I assume this is because Xen does not allow accessing those memory regions. So I tried out dom0-iommu=passthrough=1 and it failed with a blank screen then reboot. I did not manage to catch the output even though I ran with console=vga vga=keep noreboot. Running with dom0-iommu=map-inclusive=1 did not help either. According to the documentation ACPI tables should tell what registries devices talk on, but I guess this is another bug? psp obviously work via amdgpu driver. Not sure if it's worth bothering to debug the ccp/psp combo.

Suspend works in console btw, with the same text-jitter as I've experienced before. I would say this is on par with where my Arch Linux install was. I'm unable to start X properly after suspend, same effect as suspending while inside X.

DemiMarie commented 2 years ago

@isodude would you be able to do a git bisect between Xen versions to figure out what exact Xen patch fixed the problem? @marmarek any suggestions for doing so without breaking dom0 in the process?

isodude commented 2 years ago

@isodude would you be able to do a git bisect between Xen versions to figure out what exact Xen patch fixed the problem?

Oh right, actually finding the commit between xen 4.14.5 and 4.14.4 you mean, that would be easier! That I've done a couple of times so no problems.

marmarek commented 2 years ago

@marmarek Is Xen supposed to handle ccp/psp?

I don't think so, it should be all up to dom0.

marmarek commented 2 years ago

FWIW, I think the same issue applies to one system in out CI, in the suspend test: https://openqa.qubes-os.org/tests/44851#downloads (see suspend-journalctl.log). I can review historical logs, but AFAIR it never worked there. This one has AMD Ryzen 5 4500U.

I did not manage to bisect xen 4.15 - 4.14,

Bisection Xen between major versions is complicated, because of unstable ABI. Without rebuilding several more parts (toolstack, libvirt etc) no VM will start.

Does the issue happen if no VM is running at all too? If so, testing different Xen versions will be much easier. As for Xen patches used in Qubes, most of them are for libxl; the hypervisor itself should work without any patches.

isodude commented 2 years ago

Does the issue happen if no VM is running at all too? If so, testing different Xen versions will be much easier. As for Xen patches used in Qubes, most of them are for libxl; the hypervisor itself should work without any patches.

You could run Xen, suspend, start X, profit. I guess if I don't start lightdm I can run it without entering X and run the programs from console. Earlier I could trig the problem just starting Xen and the suspend without entering X. That now works.

FWIW, I think the same issue applies to one system in out CI, in the suspend test: https://openqa.qubes-os.org/tests/44851#downloads (see suspend-journalctl.log). I can review historical logs, but AFAIR it never worked there. This one has AMD Ryzen 5 4500U.

This should be correct. Well that is awesome!

isodude commented 2 years ago

@marmarek Is Xen supposed to handle ccp/psp?

I don't think so, it should be all up to dom0.

Ok, so. Do we have any prior art on how to use rmrr in Xen to allow Dom0 to write/read to ccp? I tried to fiddle around with it yesterday but I really have zero clue :)

marmarek commented 2 years ago

how to use rmrr in Xen to allow Dom0 to write/read to ccp?

IIUC dom0 should be able to write to any device by default (unless assigned to another domain). You can check the assignments with xl debug-key Q

isodude commented 2 years ago

ccp is 0000:07:00.2.

[2022-08-04 13:21:39] (XEN) 0000:07:00.6 - d0 - node -1  - MSIs < 103 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.5 - d0 - node -1  - MSIs < 104 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.4 - d1 - node -1  - MSIs < 78 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.3 - d1 - node -1  - MSIs < 76 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.2 - d0 - node -1  - MSIs < 65 66 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.1 - d0 - node -1  - MSIs < 102 >
[2022-08-04 13:21:39] (XEN) 0000:07:00.0 - d0 - node -1  - MSIs < 101 >

sending xl debug-keys h revealed more info and I ran MSI dump (M)

[2022-08-04 13:35:27] (XEN)  MSI     62 vec=41  fixed  edge   assert phys    cpu dest=00000000 mask=0/  /?
[2022-08-04 13:35:27] (XEN)  MSI     63 vec=49  fixed  edge   assert phys    cpu dest=00000000 mask=0/  /?
[2022-08-04 13:35:27] (XEN)  MSI-X   64 vec=2a  fixed  edge   assert phys    cpu dest=0000000c mask=1/  /0
[2022-08-04 13:35:27] (XEN)  MSI-X   65 vec=91  fixed  edge   assert phys    cpu dest=00000000 mask=1/HG/1
[2022-08-04 13:35:27] (XEN)  MSI-X   66 vec=99  fixed  edge   assert phys    cpu dest=00000000 mask=1/HG/1
[2022-08-04 13:35:27] (XEN)  MSI-X   67 vec=d9  fixed  edge   assert phys    cpu dest=0000000e mask=1/  /0
[2022-08-04 13:35:27] (XEN)  MSI-X   68 vec=47  fixed  edge   assert phys    cpu dest=00000000 mask=1/  /0
[2022-08-04 13:35:27] (XEN)  MSI-X   69 vec=4e  fixed  edge   assert phys    cpu dest=0000000e mask=1/  /0

Looking through the code H means host masked and G guest masked.

lspci -vvvv tells us

07:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
        Subsystem: Lenovo Device 5081
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin C routed to IRQ 36
        Region 2: Memory at fd200000 (32-bit, non-prefetchable) [size=1M]
        Region 5: Memory at fd3cc000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [64] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a0] MSI: Enable- Count=1/2 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [c0] MSI-X: Enable+ Count=2 Masked-
                Vector table: BAR=5 offset=00000000
                PBA: BAR=5 offset=00001000
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Kernel driver in use: ccp
        Kernel modules: ccp

It's MSI-X enabled, but not maskable by us. Which I assume is correct since Xen handles the masking. I don't understand why only those MSI have both H and G while none other have it.

k4z4n0v4 commented 2 years ago

Can i jump in and provide any help? I have a Ryzen 5xxx series that also exhibits the same behaviour.

isodude commented 2 years ago

I'm trying out the xen cmdline rmrr=0xfd200000-0xfd2fffff=0:7:0.2 now, but it isn't recognized anywhere.

Can i jump in and provide any help? I have a Ryzen 5xxx series that also exhibits the same behaviour.

What I'm about to do is to bisect Xen to see where between 4.14.X and 4.14.5 the suspend in console started working properly. What versions are you running now and how does suspend work if you stop lightdm (or w/e login manager you have) and run systemctl suspend in console?

k4z4n0v4 commented 2 years ago

Running Xen 4.14.5, after qvm-shutdown --all and systemctl stop lightdm, resuming from a suspended state freezes the TTY. Can't switch to another, won't accept any input.

On August 4, 2022 1:01:07 PM UTC, Josef Johansson @.***> wrote:

I'm trying out the xen cmdline rmrr=0xfd200000-0xfd2fffff=0:7:0.2 now, but it isn't recognized anywhere.

Can i jump in and provide any help? I have a Ryzen 5xxx series that also exhibits the same behaviour.

What I'm about to do is to bisect Xen to see where between 4.14.X and 4.14.5 the suspend in console started working properly. What versions are you running now and how does suspend work if you stop lightdm (or w/e login manager you have) and run systemctl suspend in console?

-- Reply to this email directly or view it on GitHub: https://github.com/QubesOS/qubes-issues/issues/6923#issuecomment-1205224774 You are receiving this because you commented.

Message ID: @.***>

k4z4n0v4 commented 2 years ago

Actually, on my second attempt it resumed properly the way it was intended. I can send the relevant parts of the two boot logs if needed, for diffing.

isodude commented 2 years ago

Actually, on my second attempt it resumed properly the way it was intended. I can send the relevant parts of the two boot logs if needed, for diffing.

Great, so we would need to check if the behavior is the same in Xen 4.14.1 through 4.14.4. If you would like to, you could check that. qubes-dom0-update --action downgrade can be used to downgrade. I think that we also need to downgrade all related xen package (rpm -qa | grep xen | grep 4.14.5). I don't remember that well, but I think so anyhow :)

isodude commented 2 years ago

I just did a batch job and tested all the way down to Xen 4.14.1-1 (couldn't install 4.14.0-X). Suspend in console works, so I guess it's a kernel issue.

isodude commented 2 years ago

Just tested latest firmware + 5.19.0 (with buddy patches applied), and X does not lock up the whole desktop in a hard state. Still doesn't work but I would say that's a progress. I may need to do a run later to see if it's the same if I change Xen version. Maybe there's something.

With that said here's the errors. dmesg.txt

isodude commented 2 years ago

Issue opened at freedesktop: https://gitlab.freedesktop.org/drm/amd/-/issues/2114

isodude commented 2 years ago

I tried out linux-firmware-20220804 (with amd firmware 22.20) and it still doesn't work, but the errors are different.

I also discovered that since I chose to compile with PMC=y my dracut failed. I fixed that, might have some affect as well. journal-20220804.txt

marmarek commented 2 years ago

Does the issue happen if no VM is running at all too? If so, testing different Xen versions will be much easier.

I did some tests. Here are results:

It does happen without any VM running at all, so it should ease debugging. I can reproduce it with running just Xorg in dom0 + suspend
It still happens on Xen master branch, sometimes only on a second suspend attempt, but unfortunately it's still the case there.

I have prepared reliable test case (2 suspend attempts + dmesg | grep "PSP resume failed") so I can launch automated bisection between arbitrary Xen or Linux versions (it will take time, but it's fully automated so it isn't my time :) ). But for that I need at least one "good" version, which apparently we don't have...

marmarek commented 2 years ago

I'm trying out the xen cmdline rmrr=0xfd200000-0xfd2fffff=0:7:0.2 now, but it isn't recognized anywhere.

This is AMD, so it's called ivmd=...

isodude commented 2 years ago

I'm trying out the xen cmdline rmrr=0xfd200000-0xfd2fffff=0:7:0.2 now, but it isn't recognized anywhere.

This is AMD, so it's called ivmd=...

Right, so that was introduced in xen 4.16. I'll see what I can do here.

isodude commented 2 years ago

I'm trying out the xen cmdline rmrr=0xfd200000-0xfd2fffff=0:7:0.2 now, but it isn't recognized anywhere.

This is AMD, so it's called ivmd=...

Right, so that was introduced in xen 4.16. I'll see what I can do here.

@marmarek Trying to apply the ivmd patches and backport them, but there is a lot of patches and I give up right now.

xen_rm_opts="edd=off ivmd=fd200-fd2ff=0:07:0.2;fd3cc-fd3cd=0:07:0.2 loglvl=debug"

Maybe try out the test bench and xen master with the above?

marmarek commented 2 years ago

I tried, but the issue is still there. I also got this warning:

(XEN) AMD-Vi: Warning: IVMD: [fd200000,fd300000) is not (entirely) in reserved memory

EDIT: after fixing quoting in grub, I get it for both ranges

isodude commented 2 years ago

I tried, but the issue is still there. I also got this warning:
(XEN) AMD-Vi: Warning: IVMD: [fd200000,fd300000) is not (entirely) in reserved memory
EDIT: after fixing quoting in grub, I get it for both ranges

Do you get any change to journalctl -b | grep ccp?

marmarek commented 2 years ago

Interestingly, adding drm.debug=1 to Linux dom0 makes S3 succeed more times (3-4 successful), but it still eventually fails. Some race condition then? Lack of synchronization between gpu and ccp resume?

isodude commented 2 years ago

Would not even surprise me. It's a job that is getting stuck in the amdgpu driver somewhere. That thing is a big mess and their fixes are tryhard at best.

What kernel are you using now?

marmarek commented 2 years ago

5.10.61 right now, but AFAIR same happened on newer too. openQA tested at least 5.15.57. I'll test newer one again to be sure.

isodude commented 2 years ago

5.10.61 right now, but AFAIR same happened on newer too. openQA tested at least 5.15.57. I'll test newer one again to be sure.

Interesting things happen in 5.17, so kernel-latest-5.18 is quite interesting.

isodude commented 2 years ago

I'm trying to build linux kernel with gcc 12 to get KCSAN.

marmarek commented 2 years ago

I've tried 5.19, the issue is still there, looks exactly the same

isodude commented 2 years ago

Booting kernel compiled with gcc 12.1 and KCSAN support made Xen crash with page fault in x86_64/entry.S#create_bounce_frame+0x135/0x157. Saw that there's one issue in 2017 in grub with almost the same problem but other than that seems like I'm doing something wrong :)

isodude commented 2 years ago

This might be very related. Regarding tee init command failed

https://lists.xenproject.org/archives/html/xen-devel/2022-06/msg01383.html

marmarek commented 2 years ago

I tried this quick workaround but it still failed to init:

Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: enabling device (0000 -> 0002)
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: ccp: unable to access the device: you might be running a broken BIOS.
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: tee: ring init command failed (0x00000005)
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: tee: failed to init ring buffer
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: tee initialization failed
Aug 10 12:25:44 localhost kernel: ccp 0000:04:00.2: psp initialization failed

I added also dom0-iommu=strict iommu=verbose to Xen, in hope to get some IOMMU page fault if that's really about address given to the device, but no failure got logged. As for the "broken BIOS" message, it can be relevant. I'll try newer firmware, but I don't have high hopes. I haven't found any related option in firmware settings. I see psp driver also has similar check that IIUC happens before ccp one, and it doesn't fire on this machine, so at least one thing appears to work...

marmarek commented 2 years ago

When I run the same Linux without Xen, I still get the "broken BIOS" message but then tee enabled and psp enabled. (still old firmware)

isodude commented 2 years ago

I'm trying to patch tee-dev.c such that ring init does not fail, will probably spend some time with it tonight. Join me if you would like :)

marmarek commented 2 years ago

So, I found one more place that needs similar patch, https://gist.github.com/marmarek/0d14a340d7045f21d8c1c35ccad4c6c4. Now I get:

[   12.775538] ccp 0000:04:00.2: enabling device (0000 -> 0002)
[   12.775681] xen: registering gsi 36 triggering 0 polarity 1
[   12.775692] Already setup the GSI :36
[   12.776025] ccp 0000:04:00.2: ccp: unable to access the device: you might be running a broken BIOS.
[   12.786222] ccp 0000:04:00.2: tee enabled
[   12.786243] ccp 0000:04:00.2: psp enabled

But the resume issue is still there

DemiMarie commented 2 years ago

@marmarek looks like __psp_pa() is broken on Xen. Better to change it once than everywhere.

isodude commented 2 years ago

@marmarek looks like __psp_pa() is broken on Xen. Better to change it once than everywhere.

Oh, better to override the #define you mean. Well, it worked on my system rocking Xen 4.14.5, so it will make it better for folks anyhow. No resume still though. I'm trying to get my kernel to work properly now with some more kernel options. I'm running a much later linux-firmware so I hope that will yield some different results.

k4z4n0v4 commented 2 years ago

(Thanks a lot for your efforts on this guys.)

isodude commented 2 years ago

Compiled the patch I commented in the amd issue and manage to boot it such that psp: enabled. I also compiled with CONFIG_AMDTEE. Same problem.

I will try with drm.debug to see if I can reproduce what marmarek noticed. I figure that we may have some problem in dma_fence in amdgpu.

isodude commented 2 years ago

Could this be related?

+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -34,6 +34,7 @@
 #include <linux/mman.h>
 #include <linux/file.h>
 #include <linux/pm_runtime.h>
 #include "amdgpu_amdkfd.h"
 #include "amdgpu.h"

@@ -1969,7 +1970,7 @@ int kfd_reserved_mem_mmap(struct kfd_dev *dev, struct kfd_process *process,
                | VM_NORESERVE | VM_DONTDUMP | VM_PFNMAP;
        /* Mapping pages to user process */
        return remap_pfn_range(vma, vma->vm_start,
                              PFN_DOWN(__pa(qpd->cwsr_kaddr)),

QubesOS / qubes-issues