Open unclejack opened 11 months ago
Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1880004592
Feedback this to the AMD bugzilla above then. Their engineers are struggling to repro this issue it seems. The more reproducer they get the quicker we can get a fix from them.
https://www.phoronix.com/news/Radeon-Gallium3D-SDMA-Dropped
https://www.phoronix.com/news/RadeonSI-Disables-Polaris-SDMA
https://www.phoronix.com/news/RadeonSI-SDMA-CIK-CZ-Again
https://www.phoronix.com/news/AMDGPU-LSDMA-Light-SDMA
https://www.phoronix.com/news/RadeonSI-New-SDMA-Tex-Copy
I decided to google SDMA for the heck of it. The features has so many issues
https://gitlab.freedesktop.org/mesa/mesa/-/issues/1889
AMD_DEBUG=nodma
I cannot reproduce this bug reliably, but I wonder whether this flag will help.
If anyone that can reproduce this easily enough can test this, that would be useful:
https://gitlab.freedesktop.org/drm/amd/-/issues/2220#note_2229270
I doubt it will change much, but you can always try.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1880143149
Additional details found here: https://rocm.docs.amd.com/en/develop/conceptual/gpu-memory.html#system-direct-memory-access
It seems that my hyphotesis on the function based on the name wasn't that far off.
Completely forgot about my desktop crashes on idle, but it was still a sdma0 ring crash again.
@unclejack if you can easily test stuff, can you also try this test patch here? I never managed to test this especially on an immutable distro, but it might be worth to test with people that can repro this much quicker.
Thanks,
Marco.
Another data point I've found roaming around is that seemingly forcing amdgpu.vm_update_mode=3
seems to resolve the issue, and according to the driver docs:
vm_update_mode (int)
Override VM update mode. VM updated by using CPU (0 = never, 1 = Graphics only, 2 = Compute only, 3 = Both). The default is -1 (Only in large BAR(LB) systems Compute VM tables will be updated by CPU, otherwise 0, never).
This always force the CPU to do virtual memory updates (which likely basically disable the sdma ring to do the job). It has a performance hit, but for testing it might be worth it temporarily while AMD wakes up. The patch in my previous post should be tested before this tho.
@RodoMa92: I don't have a simple way to reproduce the crash. It happens randomly when I'm in game. Perhaps it might be a good idea to find a way to reproduce the crash in a reliable way. That's likely to be a better idea.
Regardless, I'll build a new kernel with all the stab-in-the-dark kernel patches later.
update: I've managed to crash the driver again. Starting the same game which plays video before starting is what crashed it on a fresh boot. The Steam Deck was on battery. The newly built kernel is based on Valve's 6.1.52-valve14 kernel tree. It has the patch https://gitlab.freedesktop.org/drm/amd/uploads/ecfb67b0ae46e95d7ab30c49c932c95f/0001-drm-amdgpu-add-wmb-barrier-for-sdma-timeout-issue-te.patch applied on top. The patched kernel hasn't crashed yet. That probably doesn't mean much since it doesn't always crash.
Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1881479445
You got a gpu crash and it recovered or only the game crashed? Does the kernel log looks identical as before?
Sadly mine has been quite decent since I've got it, so repro this on my end is extremely difficult.
Ok, this is what I've done so far:
I've also posted a comment on GitLab to let the people from AMD know. I have a semi-reliable way to trigger this bug now.
update: Since the amdgpu driver still crashes with the patch which attempts to work around potential cache coherence issues and the patches are provided on GitLab, I'll only post updates there. There have been no updates from Valve and no guidance was provided either after testing. I'll stop testing and debugging once the full set of existing patches is tested.
Can confirm mine also has the same sort of issues, logs are the same too save for the temp warnings. Setting the UMA frame buffer to 4gb seems to offer longer times between crashes, but doesn't stop them entirely.
edit: I did have some success, had a starfield save in a shop where firing would instantly cause the crash, after doing an APU reset which I didn't even realize it was an option, it didn't crash and I was able to follow through. I'll update if the issue seems to return, but in the meantime if anyone finds this post and is out of options otherwise, power off the device. Hold down the vol - button and the quick access (...) button, and then hold the power button, after it beeps let go of the power button and wait for it to boot, it'll take a while
edit 2: Unfortunately still got it it just took a while :(
https://store.steampowered.com/news/app/1675200/view/4064004735511926127 has become available. It includes some changes which may help avoid crashes in some cases. The relevant commits added since valve14 are here https://gitlab.com/evlaV/linux-integration/-/commits/6.1.52-valve16?ref_type=tags.
Testing that 3.5.15 preview release might be a good idea if your Steam Deck crashes.
Voltage offsets for undervolting should be disabled in the firmware or in any tool if you have something like that. The silicon might not be good enough to work properly with those voltage offsets or may cause instability under load. It's something to rule out anyway.
I believe I'm seeing the same issues described here (with similar GPU / gamescope logs and without any temperature warnings).
Testing that 3.5.15 preview release might be a good idea if your Steam Deck crashes.
Am I right in thinking these changes have made it to the stable channel now? I'm on OS Version 3.5.17
, Kernel Version 6.1.52-valve16
, where it sounds like the potential fixes were released, so it seems they haven't helped in my case.
I'm also experiencing this error in several games. I even have a support ticket open in steamppwered.com (HT-6PJV-6T7D-XJD6).
They asked me to send the Steam Deck to the Repair Center, even after I told them about this GitHub issue. Although, in my opinion, if this issue is in fact caused by a driver, sending it to the Repair Center is both a waste of time and resources for both Steam and myself.
But since they didn't provide me with an alternative I will send it anyway. I'm writing this in case it can help with fixing the issue.
To easily reproduce the issue try playing "Headsnatchers" in single player mode (aka Zombie Castle). Everytime I tried it crashed in less than 15min (even if you leave the game open without playing, it tends to end up crashing, eventually).
I have always used stable channel, my Steam Deck specs are as follows:
OS Name: "SteamOS Holo" OS Codename: holo OS Variant: steamdeck OS Version: 3.5.19 OS Build: 20240422.1 Kernel Version: 6.1.52-valve16-1-neptune-61 Steam Deck Controller FW Build Date: Sun, Mar 3 11:54 PM UTC +01:00 BIOS Version: F7A0120
Steam Version: 1716584667 Steam Client Build Date: Fri, May 24 10:48 PM UTC +01:00 Steam Web Build Date: Fri, May 24 10:31 PM UTC +01:00 Steam API Version: SteamClient021
CPU Vendor: AuthenticAMD CPU Name: AMD Custom APU 0405 CPU Frequency: 2.8 GHz CPU Physical Cores: 4 CPU Logical Cores: 8 RAM Size: 14.47 GB Video Card: AMD AMD Custom GPU 0405 (vangogh, LLVM 15.0.7, DRM 3.5.4, 6.1.52-valve16-1-neptune-61) Video Driver: 4.6 (Compatibility Profile) Mesa 23.1.3 (git-87ebaf765d) VRAM Size: 1,024 MB
It's the LCD 512GB SSD model. I bought it refurbished directly from Steam on November 2023. It has been presenting this issue from day 1.
In case this helps, I have been able to play the 3D game "Prey" withouth any issues.
Those who still run into this issue should send their Steam Deck's serial number to Mario Limonciello from AMD: https://gitlab.freedesktop.org/drm/amd/-/issues/3111#note_2438007. The goal is to figure out whether all of the affected units are from the same batch. This could help sort out this issue.
https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-2148121815
He didnt say what his email is?
For those who can still reproduce this issue, please post your details here https://gitlab.freedesktop.org/drm/amd/-/issues/3111. This will help avoid having the issue closed there. The issue should stay open since the root cause of the issue hasn't been discovered. My unit doesn't seem to exhibit the issue anymore.
@emcy849: You can find that easily on the Internet. It might not be a good idea to post it to avoid spam.
Just obfuscate the email and post it?
Your system information
Please describe your issue in as much detail as possible:
I expected gamescope and the gpu driver to not crash.
What happened:
Steps for reproducing this issue: