GPUOpen-Drivers / AMDVLK

AMD Open Source Driver For Vulkan
MIT License
1.69k stars 160 forks source link

RenderDoc Capture with Mesh shader payload causes GPU resets and system freezes #363

Open Firestar99 opened 1 month ago

Firestar99 commented 1 month ago

Opening a RenderDoc Capture that has task and mesh shaders which utilize a payload to send data between them causes GPU resets and system freezes. I have found two very different ways of triggering it, both of them somehow related to amdvlk. These repo instructions assume a clean Ubuntu 24.04 system to start, so only RADV and no AMDVLK installed.

Standard AMDVLK Capture

  1. Install AMDVLK deb package on your system
  2. Open the AMDVLK capture
  3. Rarely the RenderDoc will freeze here already
  4. Select the single vkCmdDrawMeshTasksEXT call
  5. Observe the RenderDoc window freezing, and a 5/5 system freeze up to ~1min later

Opening a RADV Capture while AMDVLK is just present but unused

  1. Open the RADV capture and observe Renderdoc working as expected
  2. Install AMDVLK deb package on your system
  3. Delete /etc/vulkan/implicit_layer.d/amd_icd64.json to remove the VK_LAYER_AMD_switchable_graphics_64 implicit layer, which forces you to always use the amdvlk driver
  4. verify that vulkanCapsViewer can see both drivers, RADV with AMD Radeon Graphics (RADV REMBRANDT) and amdvlk with AMD Radeon Graphics (I wish amdvlk had a more identifiable name)
  5. Open the same RADV capture again, but this time observe Renderdoc freezing, likely followed by a 5/5 GPU reset or rarely 1/5 system freeze
  6. With RADV you don't even need to select the draw itself, loading the capture is almost always enough.

=> My current conclusion is that opening a RADV capture and an amdvlk device being available, even though it is unused, is enough to cause the Renderdoc to freeze and a gpu reset to follow.

RenderDoc log in case you want to confirm that RenderDoc indeed uses RADV as the replay device, and AMDVLK just being present.

Related issues

https://github.com/baldurk/renderdoc/issues/3309 https://gitlab.freedesktop.org/mesa/mesa/-/issues/11156

WenqingLiAMD commented 1 day ago

Hello @Firestar99 , Thanks for the bug report.

I tried to reproduce the issue with new_amdvlk_freeze.rdc and new_radv_with_amdvlk_installed_gpu_reset.rdc on NAVI21, but neither GPU resets nor system freezes are observerd.

=> My current conclusion is that opening a RADV capture and an amdvlk device being available, even though it is unused, is enough to cause the Renderdoc to freeze and a gpu reset to follow.

Can you please confirm above conclusion by export VK_DRIVER_FILES to force the loader to use radv or amdvlk? For example, if the env is not set, we can see both radv and amdvlk devices:

INFO | DRIVER:    linux_read_sorted_physical_devices:
INFO | DRIVER:         Original order:
INFO | DRIVER:               [0] AMD Radeon RX 6800 (RADV NAVI21)
INFO | DRIVER:               [1] llvmpipe (LLVM 15.0.7, 256 bits)
INFO | DRIVER:               [2] AMD Radeon RX 6800
INFO | DRIVER:         Sorted order:
INFO | DRIVER:               [0] AMD Radeon RX 6800 (RADV NAVI21)
INFO | DRIVER:               [1] AMD Radeon RX 6800
INFO | DRIVER:               [2] llvmpipe (LLVM 15.0.7, 256 bits)

once you specify the radv json: export VK_DRIVER_FILES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json, you can only see radv device

INFO | DRIVER:    linux_read_sorted_physical_devices:
INFO | DRIVER:         Original order:
INFO | DRIVER:               [0] AMD Radeon RX 6800 (RADV NAVI21)
INFO | DRIVER:         Sorted order:
INFO | DRIVER:               [0] AMD Radeon RX 6800 (RADV NAVI21)