Open lisyarus opened 9 months ago
I'd love to help with debugging this issue, but making a smallest reproducible examples feels shaky, since the less stuff is happening on the GPU, the less frequently does this issue reproduce.
Is there anything else I can do to help?
If you don't already, install the vulkan sdk, make sure you've hooked up a logger to wgpu, and run with the validation layers enabled - this might catch the error if it's particularly egregious.
Otherwise it might be some ub in the usage of wgpu native.
If it's not that it might necessitate the use of nvidia aftermath.
What am I doing
I am generating texture mipmaps using a simple compute shader. The texture loading process is asynchronous and happens in a separate thread, which loads the texture from file, creates the texture object, writes data to it (queue.write_texture), and submits compute passes that generate mipmap levels.
The relevant code of the project is here.
Observed behavior
Usually, everything works fine, but occasionally the GPU hangs, after which either I have to kill the process manually, or the process crashes once I switch to a different window, with the following error (with
RUST_BACKTRACE=full
):(I'm guessing that
Parent device is lost
indicates that most of this stacktrace is pretty much irrelevant?)Expected behavior
The compute pass should finish correctly and always generate the mipmaps without crashing or hanging.
Tech stack
I am using a trunk build of wgpu-native, in order to access better synchronization and filtering floating-point textures, - i.e. features, that were merged not long ago, as I understand it. The project is in C++ (using
gcc 13.2.1 20230826
compiler), uses SDL2 for window creation, and a Vulkan backend for wgpu.Wgpu device is requested with
float32-filterable
feature enabled.Additional notes
I observed frequent hangs earlier, and noticed that I never end the compute pass. After adding the appropriate
wgpuComputePassEncoderEnd
call, the hangs became much less frequent, and occur once every 5-10 runs of the program.I also observed that if the compute pass does literally nothing (is created, then immediately ended, turned into a command buffer, and submitted to the queue), the hangs still occur, though even less frequently.
If the whole compute pass (from creating an encoder to submitting the command buffer to the queue) is moved to the main rendering thread, the hangs disappear, and everything works as expected.
Update: curiously, if I move just the
wgpuQueueSubmit
to the main rendering thread (and the texture creation & compute pass encoding is still in the separate thread), the hangs go away as well.System info
Operating system: 6.1.19-gentoo
CPI info:
GPU: NVIDIA GeForce GTX 1060 6GB
Wgpu limits, as returned by
wgpuAdapterGetLimits
: