JGCRI / cassandra

Human-earth system multi-scale model coupling framework
Other
6 stars 3 forks source link

Deadlock detection #27

Open rplzzz opened 5 years ago

rplzzz commented 5 years ago

I have an idea for detecting deadlock in a setup. What if we kept a dictionary of all of the components, with the associated values being status indicators. On initialization, each component would set its status to ready. On exit it would set it to finished. In the meantime, whenever it enters the blocking section of fetch, it would set its status to waiting, and on release it would set the status back to ready. (Needless to say, all this status updating would be handled in the component base class methods.) Then we could have a monitor thread that would periodically (every five minutes or so) wake up and check the table. If there isn't at least one ready thread, then the system has deadlocked, and we need to abort.

A more sophisticated version of this scheme would be to store the capability being waited on along with the wait status. The monitor could then use this information to build the graph of dependencies among the waiting threads. If there is a cycle in the graph, then deadlock has occurred, even if there are still ready threads, and again, we need to abort the run.

calebbraun commented 5 years ago

With the first idea, how could we avoid the possibility of having the monitor thread reporting false deadlock? It could read the status of all threads as waiting if it accessed the dictionary in the middle of another thread updating it.

Overall the second idea seems cleaner to me anyway, and not too difficult to implement.

rplzzz commented 5 years ago

@calebbraun if all the threads end up in a waiting state at the same time, then none of them is ever going to emerge (assuming that we set the status only after we have decided to wait); however, I agree with the gist of your comment. I didn't think of the graph solution until I had already written the other stuff, and I was too lazy to delete the first paragraph.