GPUOpen-Drivers / AMDVLK

AMD Open Source Driver For Vulkan
MIT License
1.7k stars 160 forks source link

GPU driver crash on BeamNG experimental Linux support #276

Closed UltraBlackLinux closed 2 months ago

UltraBlackLinux commented 2 years ago

Hey there, I've encountered a crash of the GPU driver in the BeamNG experimental Linux build (Running on Vulkan), and I don't know if this is a bug in the build or in the driver. Regardless, I would really appreciate some help, at least with debugging the error, since I have no Idea, how to go about finding the correct info in this log (Reversed):

Jun 18 11:28:48 lolcat kernel: amdgpu_cs_ioctl: 1426 callbacks suppressed
Jun 18 11:28:43 lolcat kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jun 18 11:28:43 lolcat kernel: [drm] Skip scheduling IBs!
Jun 18 11:28:43 lolcat kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset(2) succeeded!
Jun 18 11:28:43 lolcat kernel: [drm] Skip scheduling IBs!
Jun 18 11:28:43 lolcat kernel: [drm] Skip scheduling IBs!
Jun 18 11:28:43 lolcat kernel: amdgpu 0000:08:00.0: amdgpu: recover vram bo from shadow done
Jun 18 11:28:43 lolcat kernel: amdgpu 0000:08:00.0: amdgpu: recover vram bo from shadow start
Jun 18 11:28:43 lolcat kernel: [drm] VCE initialized successfully.
Jun 18 11:28:43 lolcat kernel: [drm] UVD and UVD ENC initialized successfully.
Jun 18 11:28:43 lolcat kernel: amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.2 test failed (-110)
Jun 18 11:28:43 lolcat kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jun 18 11:28:43 lolcat kernel: amdgpu_cs_ioctl: 1640 callbacks suppressed
Jun 18 11:28:43 lolcat kernel: amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
Jun 18 11:28:42 lolcat kernel: [drm] VRAM is lost due to GPU reset!
Jun 18 11:28:42 lolcat kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F4006BB000).
Jun 18 11:28:42 lolcat kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset succeeded, trying to resume
Jun 18 11:28:42 lolcat kernel: amdgpu 0000:08:00.0: amdgpu: BACO reset
Jun 18 11:28:42 lolcat kernel:  </TASK>
Jun 18 11:28:42 lolcat kernel:  ret_from_fork+0x22/0x30
Jun 18 11:28:42 lolcat kernel:  ? kthread_complete_and_exit+0x20/0x20
Jun 18 11:28:42 lolcat kernel:  kthread+0x13f/0x160
Jun 18 11:28:42 lolcat kernel:  ? process_one_work+0x410/0x410
Jun 18 11:28:42 lolcat kernel:  worker_thread+0x55/0x4d0
Jun 18 11:28:42 lolcat kernel:  process_one_work+0x255/0x410
Jun 18 11:28:42 lolcat kernel:  drm_sched_job_timedout+0x76/0x100 [gpu_sched e1ad176e079cc4741fc567ca39affb7fff944b55]
Jun 18 11:28:42 lolcat kernel:  amdgpu_job_timedout+0x18c/0x1c0 [amdgpu b7b4a8b02712d7291d126b834fe1e40fca4fc677]
Jun 18 11:28:42 lolcat kernel:  amdgpu_device_gpu_recover_imp.cold+0x6a9/0xa02 [amdgpu b7b4a8b02712d7291d126b834fe1e40fca4fc677]
Jun 18 11:28:42 lolcat kernel:  amdgpu_do_asic_reset+0x2a/0x470 [amdgpu b7b4a8b02712d7291d126b834fe1e40fca4fc677]
Jun 18 11:28:42 lolcat kernel:  dump_stack_lvl+0x48/0x5d
Jun 18 11:28:42 lolcat kernel:  <TASK>
Jun 18 11:28:42 lolcat kernel: Call Trace:
Jun 18 11:28:42 lolcat kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Jun 18 11:28:42 lolcat kernel: Hardware name: Gigabyte Technology Co., Ltd. AX370-Gaming 3/AX370-Gaming 3-CF, BIOS F50a 11/27/2019
Jun 18 11:28:42 lolcat kernel: CPU: 8 PID: 148 Comm: kworker/u64:4 Not tainted 5.18.5-zen1-1-zen #1 1b0f14670b06387fbfac0d7e749656f6285dc8ca
Jun 18 11:28:42 lolcat kernel: amdgpu: rlc is busy, skip halt rlc
Jun 18 11:28:42 lolcat kernel: amdgpu: cp is busy, skip halt cp
Jun 18 11:28:42 lolcat kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Jun 18 11:28:42 lolcat kernel: amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jun 18 11:28:41 lolcat kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
Jun 18 11:28:41 lolcat kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process BeamNG.drive.x6 pid 14825 thread BeamNG.drive.x6 pid 14825
Jun 18 11:28:41 lolcat kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=50033, emitted seq=50036
Jun 18 11:28:41 lolcat kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jun 18 11:28:41 lolcat kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jun 18 11:28:41 lolcat kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jun 18 11:28:36 lolcat kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!

...

 ESCOC
 BeamNG.drive.x6 pid 14825 thread BeamNG.drive.x6 pid 14825

...

 BeamNG.drive.x6 pid 14825 thread BeamNG.drive.x6 pid 14825
19863367, read from 'TC0' (0x54433000) (72)
01
47

...

Jun 18 11:28:22 lolcat kernel: amdgpu 0000:08:00.0: amdgpu: VM fault (0x01, vmid 4, pasid 32779) at page 219863367, read from 'TC0' (0x54433000) (72)
Jun 18 11:28:22 lolcat kernel: amdgpu 0000:08:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08048001
Jun 18 11:28:22 lolcat kernel: amdgpu 0000:08:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0D1AD947
Jun 18 11:28:22 lolcat kernel: amdgpu 0000:08:00.0: amdgpu: GPU fault detected: 147 0x0a384801 for process BeamNG.drive.x6 pid 14825 thread BeamNG.drive.x6 pid 14825

(Complete journal) The GPU driver crashed and reloaded, but with the whole screen being one giant mess of artifacts and glitched colors. Is this a problem in BeamNG or my driver configuration or something like that? I'm pretty new to this whole graphic driver debugging situation, so I pretty much have no clue, what I am doing. Thanks!

Flakebi commented 2 years ago

Judging from the output

amdgpu: GPU fault detected: 147 0x0a384801 for process BeamNG.drive.x6 pid 14825 thread BeamNG.drive.x6 pid 14825

it’s a segfault on the GPU, but I’m not familiar enough with the kernel part to know what the TC0 means. That could be caused by either BeamNG or the driver.

The GPU driver crashed and reloaded, but with the whole screen being one giant mess of artifacts and glitched colors.

That part is expected after a segfault. Linux cannot recover from a GPU reset (which also resets VRAM), so one needs to restart at least the gui afterwards. It’s advisable to restart the whole system though, there can be weird issues after a GPU reset.