GPUOpen-Drivers / AMDVLK

AMD Open Source Driver For Vulkan
MIT License
1.69k stars 160 forks source link

Broken Linux Vulkan support on Ryzen 5 7640U APU with Radeon 760M RDNA3 graphics #352

Closed aufkrawall closed 2 months ago

aufkrawall commented 5 months ago

Framework Laptop 13 Bios version 3.03 Ryzen 5 7640U APU amdvlk 2024.Q1.1 Arch Linux 6.7

Vulkan support works in e.g. Strange Brigade, game seems to render fine. However, there is a huge number of games that are broken with this GPU.

Left 4 Dead 2 (native Linux version, start with -vulkan): crashes instantly or after a few seconds Serious Sam Fusion (native Vulkan): instantly crashes Lots of games with DXVK / VKD3D-Proton: Some work, but many instantly crash.

On Windows, Left 4 Dead 2 with DXVK doesn't crash (tested with with 24.1.1. driver). On my other Linux system with dedicated Radeon 7800 XT graphics card, the crash issue also doesn't exist on Linux with amdvlk. So apparently RDNA3 APU GPU support is broken in some special way.

The issue seems to be not just limited to amdvlk, but also RADV doesn't work correctly: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10427

It's very frustrating that this APU still doesn't work correctly on Linux. If you need any further information like some logs, I'll gladly be of help. But there really needs to be something done about this.

perlfu commented 5 months ago

@aufkrawall is there anything amdgpu related in the kernel dmesg log? i.e. page faults, timeouts, etc

aufkrawall commented 5 months ago

@perlfu Doesn't look like it: dmesg.log

Some levels in Left 4 Dead 2 seem to work (Left 4 Dead 1 ones with different shaders?), but e.g. "Dead Center" crashes pretty much instantly (on RADV as well).

Edit: In case you wonder about amdgpu.sg_display=0 as kernel boot parameter: I've tested without it and Linux 6.8-rc as well, with the same result.

Purpursarkans commented 4 months ago

arch GPU: 6700 non-xt

i think ambient occlusion does not work, checked in no mans sky (the kernel freezes when opening any inventory or menu) and in godot 4.2.1 steam version (the kernel freezes when AO is turned on), (everything works as it should on amdvlk 2023.Q4.1-1) here are the logs:

journalctl -b -1 -p 2 https://pastebin.com/rWy7GzTr

journalctl -b -1 -p 3 https://pastebin.com/RyhE6tNN

aufkrawall commented 4 months ago

This issue is confirmed by another user. It seems 760M GPU is affected, whereas 780M GPU is not affected: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10427#note_2313371

  1. Why did apparently AMD never test their own 760M APU on Linux for over a year?
  2. How can such discrepancy between Windows and Linux drivers happen in the first place?
  3. Why is nothing happening now? Where are inquiries by AMD devs to tackle this issue down?
  4. Is Linux support for 760M GPU just fake? Because lots of applications that don't violate Vulkan spec SIMPLY DON'T WORK.
perlfu commented 4 months ago

@aufkrawall Thank you for the link to investigations on Mesa. To confirm, the issue seems to be specifically 760M on Linux with either RADV or AMDVLK? 780M and Windows are unaffected? If so, this seems likely a Linux kernel driver (or firmware) issue.

@jinjianrong do you have an appropriate Linux KMD contact to send this for further investigation?

aufkrawall commented 4 months ago

@aufkrawall Thank you for the link to investigations on Mesa. To confirm, the issue seems to be specifically 760M on Linux with either RADV or AMDVLK? 780M and Windows are unaffected?

Thanks for your response. Yes. Mesa dev Samuel Pitoiset ( @hakzsam ) has a 780M GPU and wasn't able to reproduce (apart from an issue with UE5 Nanite which was fixed). The user Roy Shapiro ( @royshapiro ) initially tested with a 760M GPU and could reproduce all issues (both crashes of some games and visual corruption in others). He then switched to a 780M APU/GPU with otherwise unchanged system and the issues went away. Only Linux seems to be affected, Windows driver seems to behave as expected (e.g. Left 4 Dead 2 Vulkan works and doesn't crash, unlike on Linux with both amdvlk and radv).

RoyShapiro commented 4 months ago

Hi! To be specific, I was able to reproduce and confirm the issue affects 760M, but not 780M on RADV. AMDVLK still barely works on 760M at all. Before around 20240208.fbef4d38-1 linux-firmware update, basically neither AMDVLK, nor RADV worked for me on 760M.

More specifically, I was getting

[ 1296.534448] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=312160, emitted seq=312161
[ 1296.534936] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process (_insert game executable name here_) pid 3071 thread (_insert game executable name here_).exe pid 3071
[ 1296.535335] amdgpu 0000:0d:00.0: amdgpu: GPU reset begin!
[ 1296.698142] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1296.698283] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue

in dmesg.

After that update, literally a couple of games started working with AMDVLK (Shadow of the Tomb Raider being one of them), the rest still crash, while on RADV most started at least trying to work, but some crash and some have visual glitches mentioned. Of them, all work perfectly fine on 780M with RADV. I did not yet test 780M with AMDVLK, though I presume it will also work.

Considering @hakzsam said "That's super weird"(c) about the issue when presented with renderdoc capture by @aufkrawall , and the aforementioned heavy positive effect linux-firmware amdgpu blob update had on the issue, this leads me to believe that the problem has to do with that AMD firmware somehow.

RoyShapiro commented 4 months ago

Okay, did some more tests. Neither Serious Sam Fusion 2017 nor Hogwarts Legacy work with AMDVLK on 760M with *ERROR* ring gfx_0.0.0 timeout, signaled seq=some_number, emitted seq=other_number. However, on 780M both work as expected, no issues. AMDVLK used is 2024.Q1.1.

jinjianrong commented 4 months ago

Thanks all for reporting the issue. We are trying to reproduce the issue internally.

jinjianrong commented 3 months ago

We are able to reproduce the crash issue with Serious Sam Fusion on 760M. However, after installing the basekit (including KMD and firmware) from https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-23-40-2-0, the game can run although there are some other issues.

aufkrawall commented 3 months ago

Thanks, glad you can confirm internally. In case there will be some updated firmware binary or kernel patch etc. ready to fix things in upstream linux-firmware or kernel, I'd gladly test them and report back.

teschnei commented 3 months ago

Hi, we noticed issues with occlusion queries in some of the games listed here, and made a fix here. I don't know if it'll fix all the issues you were seeing, but since it's occlusion query related, I guess it will have an effect on the flickering geometry that was being seen. I had tested only on SS Fusion from your list, so if you could test a kernel patched with this to see if it fixes your issues, that'd be great, thanks.

RoyShapiro commented 2 months ago

@teschnei, @aufkrawall This works. Tested Serious Sam Fusion 2017 in all modes (DXVK (D3D11), VKD3D-Proton (D3D12), Vulkan) all of which work with this fix now, and Hogwarts Legacy which with this fix no longer has any visible graphical issues, Shadow of the Tomb Raider also plays fine. I will try to test more games soon and post if any don't work, but these definitely work now.

Now for the mainline distro kernels to receive this fix too... :grin:

aufkrawall commented 2 months ago

Wow, it really looks like a 100% fix. I've tested Left 4 Dead 2 with amdvlk (but had to resort to Proton with -vulkan, imho looks like newest amdvlk version with graphics pipeline library support causes the native Linux version with -vulkan issues), Left 4 Dead 2 with radv, Hogwarts Legacy with radv and Borderlands 2 with radv and they all seem to work now without visual corruption and without crashes. Crazy that a patch that changes one symbol can make such a dramatic difference.

Thanks, everyone! I guess we can close this once it lands in stable kernel. Which hopefully is soon, as it's already in 6.9-rc4.

aufkrawall commented 2 months ago

Fix is in 6.8.7, closing.