Open SludgePhD opened 2 months ago
CC @jimblandy, who's been working on issues like this recently.
Could you pull out from those stacks the locks each thread is holding, if any?
Actually, it might suffice simply to know which lock each thread is trying to acquire, and I could figure out which other ones it must be holding.
The deadlock appears to be caused by:
command_encoder_end_compute_pass
acquires the buffer read lock before the bind group read lock here: https://github.com/gfx-rs/wgpu/blob/edf1a86148d1a62da857633fb224aa569f21ce4e/wgpu-core/src/command/compute_command.rs#L82-L83command_encoder_end_render_pass
acquires the bind group read lock before the buffer read lock here: https://github.com/gfx-rs/wgpu/blob/ad6774f7bb9c327238322d9e5beeb1c9a0c6e89d/wgpu-core/src/command/render.rs#L1385-L1389In the backtraces above, there is one thread in the first location holding the buffers lock and trying to acquire the bind_groups lock, and one thread in the second location holding most locks (including the bind_group one) and trying to acquire the buffers lock.
While these are all RWLocks, and these are all read lock acquisitions, there are also several threads trying to acquire write locks for both the bind_group and buffer storages. Due to the fair RWLock implementation in parking_lot, this makes the attempts to acquire read locks block until the write lock can be acquired, which then completes the deadlock.
It sounds like rank::REGISTRY_STORAGE
should be split into one rank per resource to catch mistakes like this, maybe?
Description
Might be a duplicate of one of the known deadlock issues in https://github.com/gfx-rs/wgpu/issues/5572, I'm not sure yet.
Repro steps Closed source project, so not available.
Expected vs observed behavior No deadlock vs Yes deadlock
Platform Linux, Vulkan. wgpu 0.20.0 is affected (and is where the backtraces are from), but trunk also deadlocks in a similar way.