Open jamesbornholt opened 2 years ago
@bkragl Currently, Shuttle only checks full system deadlocks (i.e., no task/thread is runnable). It seems you're suggesting we check partial deadlocks (i.e., a cycle in the Resource Allocation Graph)? That would require identifying all the shared resources (not just Mutex and RwLock).
Yes, I understand all that. To be clear, I don't think the current check is limited in its "deadlock detecting power". In case of a partial deadlock, the execution will eventually unfold to a full deadlock, unless it panics for a different reason or is in an infinite loop (which also indicates a different issue, as Shuttle test harnesses are usually expected to terminate).
There are two separate points: the cost of deadlock detection (i.e., waiting for a partial deadlock to become a full deadlock) and easy of debugging. To simplify the debugging of a deadlock, James's suggestion with stack traces sounds good. What we could also do is to print a truncated schedule that stops right after the last step of any task in the deadlock set. So when replaying the schedule, we get the experience of detecting the deadlock immediately (and won't accidentally debug ahead of the partial deadlock). Now the only remaining question is, whether it makes sense to optimize the initial detection of the partial deadlock (in practice). I don't have any evidence, so I'm not sure. In any event, the tracking in an RAG could be limited to some resources that we identify.
From @bkragl in #66:
An intermediate step would be to print stack traces for deadlocked threads, so at least you know where to start the debugging process (e.g., which locks are involved).