gfx-rs / wgpu

A cross-platform, safe, pure-Rust graphics API.
https://wgpu.rs
Apache License 2.0
11.52k stars 859 forks source link

[core] Deadlock between device.tracker and device.snatchable_lock #5737

Open sagudev opened 1 month ago

sagudev commented 1 month ago

Description More deadlocks between queue.submit and poll_all devices (both threads are running device.maintain).

queue.submit thread:

  thread #107, name = 'WGPU'
    frame #0: 0x00007ffff6f2725d libc.so.6`syscall at syscall.S:38
    frame #1: 0x000055555bd3a8c2 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at linux.rs:112:13
    frame #2: 0x000055555bd3a8a5 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers [inlined] <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park at linux.rs:66:13
    frame #3: 0x000055555bd3a886 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at parking_lot.rs:635:36
    frame #4: 0x000055555bd3a70b servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at parking_lot.rs:207:5
    frame #5: 0x000055555bd3a675 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at parking_lot.rs:600:5
    frame #6: 0x000055555bd3a675 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers(self=0x00007ffeda27b5a8, timeout=Instant>{...}, prev_value=0) at raw_rwlock.rs:1017:17
    frame #7: 0x000055555bd37c8e servo`parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow(self=0x00007ffeda27b5a8, timeout=Instant>{...}) at raw_rwlock.rs:647:9
    frame #8: 0x000055555b1586d9 servo`wgpu_core::snatch::SnatchLock::write [inlined] <parking_lot::raw_rwlock::RawRwLock as lock_api::rwlock::RawRwLock>::lock_exclusive(self=0x00007ffeda27b5a8) at raw_rwlock.rs:73:26
    frame #9: 0x000055555b1586cb servo`wgpu_core::snatch::SnatchLock::write at rwlock.rs:500:9
    frame #10: 0x000055555b1586cb servo`wgpu_core::snatch::SnatchLock::write [inlined] wgpu_core::lock::vanilla::RwLock<T>::write at vanilla.rs:85:33
    frame #11: 0x000055555b1586cb servo`wgpu_core::snatch::SnatchLock::write(self=0x00007ffeda27b5a8) at snatch.rs:154:40
    frame #12: 0x000055555b0c0e89 servo`wgpu_core::resource::Texture<A>::destroy(self=<unavailable>) at resource.rs:878:32
    frame #13: 0x000055555b0131aa servo`wgpu_core::device::resource::Device<A>::maintain at resource.rs:3649:21
    frame #14: 0x000055555b012c12 servo`wgpu_core::device::resource::Device<A>::maintain(self=0x00007ffeda279010, fence_guard=wgpu_core::lock::vanilla::RwLockReadGuard<core::option::Option<wgpu_hal::vulkan::Fence>> @ 0x00007fff505f7760, maintain=<unavailable>, snatch_guard=<unavailable>) at resource.rs:476:13
    frame #15: 0x000055555b046a79 servo`wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_submit(self=<unavailable>, queue_id=<unavailable>, command_buffer_ids=<unavailable>) at queue.rs:1494:23

poll_all_devices thread:

  thread #108, name = 'WGPU poller'
    frame #0: 0x00007ffff6f2725d libc.so.6`syscall at syscall.S:38
    frame #1: 0x000055555bd3c477 servo`parking_lot::raw_mutex::RawMutex::lock_slow at linux.rs:112:13
    frame #2: 0x000055555bd3c45a servo`parking_lot::raw_mutex::RawMutex::lock_slow [inlined] <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park at linux.rs:66:13
    frame #3: 0x000055555bd3c454 servo`parking_lot::raw_mutex::RawMutex::lock_slow at parking_lot.rs:635:36
    frame #4: 0x000055555bd3c3f9 servo`parking_lot::raw_mutex::RawMutex::lock_slow at parking_lot.rs:207:5
    frame #5: 0x000055555bd3c3f9 servo`parking_lot::raw_mutex::RawMutex::lock_slow at parking_lot.rs:600:5
    frame #6: 0x000055555bd3c3f9 servo`parking_lot::raw_mutex::RawMutex::lock_slow(self=0x00007ffeda27b5b0, timeout=Instant>{...}) at raw_mutex.rs:262:17
    frame #7: 0x000055555b148a84 servo`wgpu_core::device::life::LifetimeTracker<A>::triage_suspected [inlined] <parking_lot::raw_mutex::RawMutex as lock_api::mutex::RawMutex>::lock(self=0x00007ffeda27b5b0) at raw_mutex.rs:72:13
    frame #8: 0x000055555b148a76 servo`wgpu_core::device::life::LifetimeTracker<A>::triage_suspected at mutex.rs:223:9
    frame #9: 0x000055555b148a76 servo`wgpu_core::device::life::LifetimeTracker<A>::triage_suspected at vanilla.rs:29:27
    frame #10: 0x000055555b148a76 servo`wgpu_core::device::life::LifetimeTracker<A>::triage_suspected [inlined] wgpu_core::device::life::LifetimeTracker<A>::triage_suspected_render_bundles(self=<unavailable>, trackers=0x00007ffeda27b5b0) at life.rs:501:37
    frame #11: 0x000055555b148a76 servo`wgpu_core::device::life::LifetimeTracker<A>::triage_suspected(self=0x00007ffeda27b888, trackers=0x00007ffeda27b5b0) at life.rs:786:9
    frame #12: 0x000055555b16bfd7 servo`wgpu_core::device::resource::Device<A>::maintain(self=0x00007ffeda279010, fence_guard=wgpu_core::lock::vanilla::RwLockReadGuard<core::option::Option<wgpu_hal::vulkan::Fence>> @ r15, maintain=<unavailable>, snatch_guard=<unavailable>) at resource.rs:438:9
    frame #13: 0x000055555b156789 servo`wgpu_core::device::global::<impl wgpu_core::global::Global>::poll_all_devices at global.rs:2148:39
    frame #14: 0x000055555b156769 servo`wgpu_core::device::global::<impl wgpu_core::global::Global>::poll_all_devices at global.rs:2188:21
    frame #15: 0x000055555b15661f servo`wgpu_core::device::global::<impl wgpu_core::global::Global>::poll_all_devices(self=<unavailable>, force_wait=<unavailable>) at global.rs:2213:17

Repro steps Servo https://github.com/servo/servo/pull/32354/commits/5ef507ea786af95705f62883e7148695a99bd2ee when running webgpu:api,validation,state,device_lost,destroy:createTexture,2d,uncompressed_format:*

Platform wgpu-core d0a5e48aa7e84683114c3870051cc414ae92ac03

sagudev commented 1 month ago

queue.submit thread

Acquired tracker lock here: https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/device/resource.rs#L3625 then tries to acquire https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/resource.rs#L886

poll_all_devices thread

Tries to acquire https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/device/life.rs#L518 while it already acquired snatchable lock: https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/device/global.rs#L2206

sagudev commented 1 month ago

similar deadlock also happens if one thread is destroying buffer instead of texture in https://github.com/gfx-rs/wgpu/blob/9e0fd17726ecda0cc88e8a20f911de60f1017b1a/wgpu-core/src/device/resource.rs#L3615

jimblandy commented 1 month ago

Thanks for filing this, and for the analysis.

If you look at the analysis results posted in #5586, you'll see that there's no shortage of cycles in that lock acquisition ordering graph. There are lots of ways for wgpu to deadlock right now, unfortunately. We used to have static deadlock prevention until arcanization removed it, and things went downhill fast.

I have some security-sensitive issues I need to get through first. I'm expecting to have them done by the first week in June, and then I can turn my attention back to deadlocks. I definitely encourage you or anyone else to tackle these issues themselves if you need them addressed sooner than that.

sagudev commented 3 weeks ago

Some similar deadlock happening in webgpu:api,validation,state,device_lost,destroy:queue,writeTexture,2d,uncompressed_format:* on https://github.com/servo/servo/pull/32354/commits/302954983dde0d3aa6044cefe14c8cd5a649ddb4, this time between queue_write_texture and device poll:

queue_write_texture thread:

thread backtrace
  thread #51, name = 'WGPU'
    frame #0: 0x00007895a3b2725d libc.so.6`syscall at syscall.S:38
    frame #1: 0x00005d660d9ed6d5 servo`parking_lot::raw_mutex::RawMutex::lock_slow at linux.rs:112:13
    frame #2: 0x00005d660d9ed6bb servo`parking_lot::raw_mutex::RawMutex::lock_slow [inlined] <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park at linux.rs:66:13
    frame #3: 0x00005d660d9ed6b5 servo`parking_lot::raw_mutex::RawMutex::lock_slow at parking_lot.rs:635:36
    frame #4: 0x00005d660d9ed657 servo`parking_lot::raw_mutex::RawMutex::lock_slow at parking_lot.rs:207:5
    frame #5: 0x00005d660d9ed657 servo`parking_lot::raw_mutex::RawMutex::lock_slow at parking_lot.rs:600:5
    frame #6: 0x00005d660d9ed657 servo`parking_lot::raw_mutex::RawMutex::lock_slow(self=0x0000789581cbe5a8, timeout=Instant>{...}) at raw_mutex.rs:262:17
    frame #7: 0x00005d660cb56f16 servo`wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_write_texture [inlined] <parking_lot::raw_mutex::RawMutex as lock_api::mutex::RawMutex>::lock(self=0x0000789581cbe5a8) at raw_mutex.rs:72:13
    frame #8: 0x00005d660cb56efc servo`wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_write_texture at mutex.rs:223:9
    frame #9: 0x00005d660cb56efc servo`wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_write_texture [inlined] wgpu_core::lock::vanilla::Mutex<T>::lock at vanilla.rs:29:27
    frame #10: 0x00005d660cb56ef3 servo`wgpu_core::device::queue::<impl wgpu_core::global::Global>::queue_write_texture(self=<unavailable>, queue_id=<unavailable>, destination=<unavailable>, data=<unavailable>, data_layout=<unavailable>, size=<unavailable>) at queue.rs:926:48

device poll thread:

thread backtrace
  thread #52, name = 'WGPU poller'
    frame #0: 0x00007895a3b2725d libc.so.6`syscall at syscall.S:38
    frame #1: 0x00005d660d9eb7a9 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at linux.rs:112:13
    frame #2: 0x00005d660d9eb78c servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers [inlined] <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park at linux.rs:66:13
    frame #3: 0x00005d660d9eb76a servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at parking_lot.rs:635:36
    frame #4: 0x00005d660d9eb731 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at parking_lot.rs:207:5
    frame #5: 0x00005d660d9eb731 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers at parking_lot.rs:600:5
    frame #6: 0x00005d660d9eb731 servo`parking_lot::raw_rwlock::RawRwLock::wait_for_readers(self=0x0000789581cbe5a0, timeout=Instant>{...}, prev_value=0) at raw_rwlock.rs:1017:17
    frame #7: 0x00005d660d9e91f1 servo`parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow(self=0x0000789581cbe5a0, timeout=Instant>{...}) at raw_rwlock.rs:647:9
    frame #8: 0x00005d660cc15377 servo`wgpu_core::resource::Texture<A>::destroy [inlined] <parking_lot::raw_rwlock::RawRwLock as lock_api::rwlock::RawRwLock>::lock_exclusive(self=0x0000789581cbe5a0) at raw_rwlock.rs:73:26
    frame #9: 0x00005d660cc15369 servo`wgpu_core::resource::Texture<A>::destroy at rwlock.rs:500:9
    frame #10: 0x00005d660cc15369 servo`wgpu_core::resource::Texture<A>::destroy at vanilla.rs:85:33
    frame #11: 0x00005d660cc15369 servo`wgpu_core::resource::Texture<A>::destroy [inlined] wgpu_core::snatch::SnatchLock::write(self=0x0000789581cbe5a0) at snatch.rs:154:40
    frame #12: 0x00005d660cc15369 servo`wgpu_core::resource::Texture<A>::destroy(self=<unavailable>) at resource.rs:878:32
    frame #13: 0x00005d660cbd08ee servo`wgpu_core::device::resource::Device<A>::maintain at resource.rs:3649:21
    frame #14: 0x00005d660cbd066f servo`wgpu_core::device::resource::Device<A>::maintain(self=0x0000789581cbc010, fence_guard=wgpu_core::lock::vanilla::RwLockReadGuard<core::option::Option<wgpu_hal::vulkan::Fence>> @ 0x00007894fc5f9a80, maintain=<unavailable>, snatch_guard=<unavailable>) at resource.rs:476:13
    frame #15: 0x00005d660cc8ae72 servo`wgpu_core::device::global::<impl wgpu_core::global::Global>::poll_all_devices at global.rs:2148:39
    frame #16: 0x00005d660cc8ae52 servo`wgpu_core::device::global::<impl wgpu_core::global::Global>::poll_all_devices at global.rs:2188:21
    frame #17: 0x00005d660cc8acc4 servo`wgpu_core::device::global::<impl wgpu_core::global::Global>::poll_all_devices(self=0x0000789581c63010, force_wait=<unavailable>) at global.rs:2213:17
    frame #18: 0x00005d660c9bf813 servo`webgpu::poll_thread::poll_all_devices(global=<unavailable>, more_work=0x00007894fc5fa04f, force_wait=<unavailable>, lock=()) at poll_thread.rs:57:11

queue_write_texture thread tries to acquire: https://github.com/gfx-rs/wgpu/blob/be4eabc71bbbf5ab89d8cb74f0b894d374793850/wgpu-core/src/device/queue.rs#L926 that is acquired by poller thread: https://github.com/gfx-rs/wgpu/blob/be4eabc71bbbf5ab89d8cb74f0b894d374793850/wgpu-core/src/device/resource.rs#L3628 while poller thread tries to acquire https://github.com/gfx-rs/wgpu/blob/be4eabc71bbbf5ab89d8cb74f0b894d374793850/wgpu-core/src/resource.rs#L886 that queue_write_texture thread acquired at https://github.com/gfx-rs/wgpu/blob/be4eabc71bbbf5ab89d8cb74f0b894d374793850/wgpu-core/src/device/queue.rs#L838