HansKristian-Work / vkd3d-proton

Fork of VKD3D. Development branches for Proton's Direct3D 12 implementation.
GNU Lesser General Public License v2.1
1.75k stars 183 forks source link

Monster Hunter World (DX12 High) Artifacting, Render Issues, then GPU Hang/Crash -> reset #2012

Open robobenklein opened 3 weeks ago

robobenklein commented 3 weeks ago

MHW gradually gets worse over a play session, first with minor graphical artifacting, leading to game-breaking visibility glitches, and then eventually a GPU hang/reset. (Is easily repeatable)

It seems DX12 on High settings has been a problem in the past, but I haven't seen anything this bad yet (screenshot from a few minutes before the GPU hangs):

20240607021221_1 20240607021317_1

No issues with DX11, but DX12 in general performs far better until the problems start occurring.

I have video recordings if wanted of ~45min of gameplay or of specific artifact occurrences. Took less than 2 hours of runtime to get to a complete system freeze / GPU hang and recovery, after which point the game was graphically frozen but still running on the CPU.

Software information

Pop!_OS 22.04, Steam, Monster Hunter: World

System information

Launch options: PROTON_LOG=1 ionice -c 2 -n 0 gamemoderun %command%

Log files

Proton logs: 18.9MB compressed: steam-582010.log.gz ~880MB uncompressed

Size-reduced / deduplicated logs: (cat steam-582010.log | tr ':' '\t' | uniq -f 3 | tr '\t' ':' > steam-582010.log.reduced -> 27MB uncompressed) steam-582010.log.reduced.txt.gz

GPU Timeout trigger:

Syslog excerpt ``` Jun 07 02:14:30 robo-triangulum kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.3.0 timeout, signaled seq=816585, emitted seq=816586 Jun 07 02:14:30 robo-triangulum kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process MonsterHunterWo pid 188420 thread vkd3d_queue pid 188583 Jun 07 02:14:30 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin! Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: IP block:gfx_v11_0 is hung! Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process pid 0 thread pid 0) Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040B53 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x1 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process pid 0 thread pid 0) Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process pid 0 thread pid 0) Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process pid 0 thread pid 0) Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x0 Jun 07 02:14:31 robo-triangulum kernel: [drm] kiq ring mec 3 pipe 1 q 0 Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow start Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow done Jun 07 02:14:31 robo-triangulum kernel: [drm] Skip scheduling IBs! Jun 07 02:14:31 robo-triangulum kernel: [drm] Skip scheduling IBs! Jun 07 02:14:31 robo-triangulum kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(1) succeeded! Jun 07 02:14:31 robo-triangulum kernel: [drm] Skip scheduling IBs! ... ``` Even after quitting all game processes, wine, steam, etc, the logs continued spamming: ``` Jun 07 02:18:14 robo-triangulum kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14 Jun 07 02:18:14 robo-triangulum kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait Jun 07 02:18:14 robo-triangulum kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14 Jun 07 02:18:14 robo-triangulum kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait Jun 07 02:18:16 robo-triangulum kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14 Jun 07 02:18:16 robo-triangulum kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait Jun 07 02:18:16 robo-triangulum kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14 Jun 07 02:18:16 robo-triangulum kernel: [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait Jun 07 02:18:16 robo-triangulum kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14 ```
doitsujin commented 3 weeks ago

Is this game even fully stable on Windows with D3D12? Everything about this screams app bug (probably memory management or descriptor related) but debugging something that takes several hours to reproduce isn't really viable for us.

robobenklein commented 3 weeks ago

If this is just an app bug then my apologies for the useless issue, it can be closed.

I did not think it should be possible for an application's bug to completely crash/hang the GPU, I would have expected the game to crash or freeze first while the hardware/desktop continued operating.

I can still repro this quite reliably, so if you have any further suggestions for what I should do to debug or help improve the drivers (to keep them from hanging the system?) that would be appreciated!

mbriar commented 3 weeks ago

Gpu hangs that take down the whole system due to app or driver bugs aren't exactly rare on amd linux and gpu recovery still sometimes fails or at least also crashes the desktop environment, although the situtation has improved somewhat.

Maybe you can get a hang report as described here: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/docs/drivers/amd/hang-debugging.rst?ref_type=heads or at least find a way to reproduce the hang that doesn't take hour of play every time.