Open sagudev opened 1 month ago
Acquired tracker lock here: https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/device/resource.rs#L3625 then tries to acquire https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/resource.rs#L886
Tries to acquire https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/device/life.rs#L518 while it already acquired snatchable lock: https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/device/global.rs#L2206
similar deadlock also happens if one thread is destroying buffer instead of texture in https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/device/resource.rs#L3615
Thanks for filing this, and for the analysis.
If you look at the analysis results posted in #5586, you'll see that there's no shortage of cycles in that lock acquisition ordering graph. There are lots of ways for wgpu to deadlock right now, unfortunately. We used to have static deadlock prevention until arcanization removed it, and things went downhill fast.
I have some security-sensitive issues I need to get through first. I'm expecting to have them done by the first week in June, and then I can turn my attention back to deadlocks. I definitely encourage you or anyone else to tackle these issues themselves if you need them addressed sooner than that.
Some similar deadlock happening in webgpu:api,validation,state,device_lost,destroy:queue,writeTexture,2d,uncompressed_format:*
on https://github.com/servo/servo/pull/32354/commits/302954983dde0d3aa6044cefe14c8cd5a649ddb4, this time between queue_write_texture
and device poll:
queue_write_texture
thread:
thread backtrace
thread #51, name = 'WGPU'
frame #0: 0x00007895a3b2725d libc.so.6`syscall at syscall.S:38
frame #1: 0x00005d660d9ed6d5 servo`parking_lot::raw_mutex::RawMutex::lock_slow at linux.rs:112:13
frame #2: 0x00005d660d9ed6bb servo`parking_lot::raw_mutex::RawMutex::lock_slow [inlined] <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park at linux.rs:66:13
frame #3: 0x00005d660d9ed6b5 servo`parking_lot::raw_mutex::RawMutex::lock_slow at parking_lot.rs:635:36
frame #4: 0x00005d660d9ed657 servo`parking_lot::raw_mutex::RawMutex::lock_slow at parking_lot.rs:207:5
frame #5: 0x00005d660d9ed657 servo`parking_lot::raw_mutex::RawMutex::lock_slow at parking_lot.rs:600:5
frame #6: 0x00005d660d9ed657 servo`parking_lot::raw_mutex::RawMutex::lock_slow(self=0x0000789581cbe5a8, timeout=Instant>{...}) at raw_mutex.rs:262:17
frame #7: 0x00005d660cb56f16 servo`wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_write_texture [inlined] <parking_lot::raw_mutex::RawMutex as lock_api::mutex::RawMutex>::lock(self=0x0000789581cbe5a8) at raw_mutex.rs:72:13
frame #8: 0x00005d660cb56efc servo`wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_write_texture at mutex.rs:223:9
frame #9: 0x00005d660cb56efc servo`wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_write_texture [inlined] wgpu_core::lock::vanilla::Mutex<T>::lock at vanilla.rs:29:27
frame #10: 0x00005d660cb56ef3 servo`wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_write_texture(self=<unavailable>, queue_id=<unavailable>, destination=<unavailable>, data=<unavailable>, data_layout=<unavailable>, size=<unavailable>) at queue.rs:926:48
device poll thread:
thread backtrace
thread #52, name = 'WGPU poller'
frame #0: 0x00007895a3b2725d libc.so.6`syscall at syscall.S:38
frame #1: 0x00005d660d9eb7a9 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at linux.rs:112:13
frame #2: 0x00005d660d9eb78c servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers [inlined] <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park at linux.rs:66:13
frame #3: 0x00005d660d9eb76a servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at parking_lot.rs:635:36
frame #4: 0x00005d660d9eb731 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at parking_lot.rs:207:5
frame #5: 0x00005d660d9eb731 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at parking_lot.rs:600:5
frame #6: 0x00005d660d9eb731 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers(self=0x0000789581cbe5a0, timeout=Instant>{...}, prev_value=0) at raw_rwlock.rs:1017:17
frame #7: 0x00005d660d9e91f1 servo`parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow(self=0x0000789581cbe5a0, timeout=Instant>{...}) at raw_rwlock.rs:647:9
frame #8: 0x00005d660cc15377 servo`wgpu_core::resource::Texture<A>::destroy [inlined] <parking_lot::raw_rwlock::RawRwLock as lock_api::rwlock::RawRwLock>::lock_exclusive(self=0x0000789581cbe5a0) at raw_rwlock.rs:73:26
frame #9: 0x00005d660cc15369 servo`wgpu_core::resource::Texture<A>::destroy at rwlock.rs:500:9
frame #10: 0x00005d660cc15369 servo`wgpu_core::resource::Texture<A>::destroy at vanilla.rs:85:33
frame #11: 0x00005d660cc15369 servo`wgpu_core::resource::Texture<A>::destroy [inlined] wgpu_core::snatch::SnatchLock::write(self=0x0000789581cbe5a0) at snatch.rs:154:40
frame #12: 0x00005d660cc15369 servo`wgpu_core::resource::Texture<A>::destroy(self=<unavailable>) at resource.rs:878:32
frame #13: 0x00005d660cbd08ee servo`wgpu_core::device::resource::Device<A>::maintain at resource.rs:3649:21
frame #14: 0x00005d660cbd066f servo`wgpu_core::device::resource::Device<A>::maintain(self=0x0000789581cbc010, fence_guard=wgpu_core::lock::vanilla::RwLockReadGuard<core::option::Option<wgpu_hal::vulkan::Fence>> @ 0x00007894fc5f9a80, maintain=<unavailable>, snatch_guard=<unavailable>) at resource.rs:476:13
frame #15: 0x00005d660cc8ae72 servo`wgpu_core::device::global::<impl wgpu_core::global::Global>::poll_all_devices at global.rs:2148:39
frame #16: 0x00005d660cc8ae52 servo`wgpu_core::device::global::<impl wgpu_core::global::Global>::poll_all_devices at global.rs:2188:21
frame #17: 0x00005d660cc8acc4 servo`wgpu_core::device::global::<impl wgpu_core::global::Global>::poll_all_devices(self=0x0000789581c63010, force_wait=<unavailable>) at global.rs:2213:17
frame #18: 0x00005d660c9bf813 servo`webgpu::poll_thread::poll_all_devices(global=<unavailable>, more_work=0x00007894fc5fa04f, force_wait=<unavailable>, lock=()) at poll_thread.rs:57:11
queue_write_texture
thread tries to acquire:
https://github.com/gfx-rs/wgpu/blob/be4eabc71bbbf5ab89d8cb74f0b894d374793850/wgpu-core/src/device/queue.rs#L926
that is acquired by poller thread:
https://github.com/gfx-rs/wgpu/blob/be4eabc71bbbf5ab89d8cb74f0b894d374793850/wgpu-core/src/device/resource.rs#L3628
while poller thread tries to acquire https://github.com/gfx-rs/wgpu/blob/be4eabc71bbbf5ab89d8cb74f0b894d374793850/wgpu-core/src/resource.rs#L886 that queue_write_texture
thread acquired at https://github.com/gfx-rs/wgpu/blob/be4eabc71bbbf5ab89d8cb74f0b894d374793850/wgpu-core/src/device/queue.rs#L838
Description More deadlocks between queue.submit and poll_all devices (both threads are running device.maintain).
queue.submit thread:
poll_all_devices thread:
Repro steps Servo https://github.com/servo/servo/pull/32354/commits/5ef507ea786af95705f62883e7148695a99bd2ee when running
webgpu:api,validation,state,device_lost,destroy:createTexture,2d,uncompressed_format:*
Platform wgpu-core d0a5e48aa7e84683114c3870051cc414ae92ac03