BurtonQin / lockbud

Statically detect memory, concurrency bugs and possible panic locations for Rust.
BSD 3-Clause "New" or "Revised" License
445 stars 25 forks source link

Detect Rayon deadlocks #55

Open michaelsproul opened 10 months ago

michaelsproul commented 10 months ago

Rayon has this footgun for lock usage where it can deadlock if both of the following conditions are true:

  1. Lock L is held while calling into Rayon (e.g. calling rayon::join while holding a mutex).
  2. Lock L is accessed from within Rayon tasks (e.g. calling mutex.lock() within par_iter).

Deadlocks can occur in multiple ways: if the first thread that is holding the lock calls into Rayon and steals some of the jobs of type (2) which require the lock, then those jobs will not complete because of the circular dependency (they are waiting for the thread to release the lock, but the thread is waiting for them to complete because it is waiting for the Rayon call to return). There's a similar variant if the thread holding the lock is itself a Rayon thread, in which case the deadlock occurs because the thread tries to re-lock the lock it already holds.

How could lockbud help? It would be prone to false positives, but it could detect any calls to Rayon functions that occur while holding a Mutex/RwLock. I think this is similar to detecting double-locking.

Unfortunately it seems that a deadlock-avoiding mutex for Rayon is not likely to be implemented any time soon. The naive approach of yielding when locking fails does not work because there is no "task" that can be yielded to which will unblock execution – we need the current task to finish to return control to the caller (see https://github.com/rayon-rs/rayon/issues/592#issuecomment-491454549). Unlike in async code, there is no way to interrupt a Rayon task while it is running, because it is just an ordinary function. This suggests it also wouldn't be feasible to port tokio::sync::Mutex which uses wakers and other futures-specific features to work around this problem in the async context (see https://github.com/tokio-rs/tokio/blob/2400769b54fd31186b07bfbec277e846e8d64c16/tokio/src/sync/batch_semaphore.rs#L2-L17).