Provide more useful feedback for deadlocks

jamesbornholt commented 2 years ago

From @bkragl in #66:

One thing I was thinking about is whether it makes sense to have a feature in Shuttle that does precise on-the-fly deadlock detection. Right now there might be a deadlock among some tasks early on in an execution, but we need to wait until the end of the execution to detect it. It might be helpful for debugging to stop the execution right when the deadlock happens.

The nice thing about the current check is that it is completely generic. What I'm proposing would require knowledge about the synchronization primitives (e.g., a combined resource allocation graph for both Mutex and RwLock).

An intermediate step would be to print stack traces for deadlocked threads, so at least you know where to start the debugging process (e.g., which locks are involved).

jorajeev commented 2 years ago

@bkragl Currently, Shuttle only checks full system deadlocks (i.e., no task/thread is runnable). It seems you're suggesting we check partial deadlocks (i.e., a cycle in the Resource Allocation Graph)? That would require identifying all the shared resources (not just Mutex and RwLock).

bkragl commented 2 years ago

Yes, I understand all that. To be clear, I don't think the current check is limited in its "deadlock detecting power". In case of a partial deadlock, the execution will eventually unfold to a full deadlock, unless it panics for a different reason or is in an infinite loop (which also indicates a different issue, as Shuttle test harnesses are usually expected to terminate).

There are two separate points: the cost of deadlock detection (i.e., waiting for a partial deadlock to become a full deadlock) and easy of debugging. To simplify the debugging of a deadlock, James's suggestion with stack traces sounds good. What we could also do is to print a truncated schedule that stops right after the last step of any task in the deadlock set. So when replaying the schedule, we get the experience of detecting the deadlock immediately (and won't accidentally debug ahead of the partial deadlock). Now the only remaining question is, whether it makes sense to optimize the initial detection of the partial deadlock (in practice). I don't have any evidence, so I'm not sure. In any event, the tracking in an RAG could be limited to some resources that we identify.

awslabs / shuttle

Provide more useful feedback for deadlocks #67