[Bug] Deadlock with rayon usage

HarukaMa commented 8 months ago

🐛 Bug Report

There is a rayon-related deadlock in snarkOS, but I'm not quite sure which situation it actually is:

Using rayon parallel iterators while holding a Mutex or write RwLock (this case). See multiple discussions like this and this.
Using rayon with blocking calls (not sure if spawn_blocking applies here). Maybe see this or this.

I think it's probably the first one, as from a deadlock core dump, I did see write lock being acquired while the node stuck at a read lock. Here is the full backtrace of all threads. (Large text file as rayon tend to generate a deep stack. The file is actually .7z but has to be named .zip to upload here.) Notice the thread 69 has the write lock to vm.process while trying to advance a block, while there are many threads trying to validate incoming unconfirmed transactions and needed a read lock.

Steps to Reproduce

Not sure. Run the node with a large number of connections?

Expected Behavior

The node should not deadlock.

Your Environment

ljedrz commented 8 months ago

This one feels like it's going to be tricky, but I'll try to investigate it soon.

raychu86 commented 5 months ago

We did initial passes, but were unable to reproduce this. Putting this on a lower priority, but will keep and eye out and revisit this.

ljedrz commented 3 months ago

@HarukaMa I've prepared a branch that's aimed at detecting deadlocks; could you try it out with one of your nodes under a workload that's likely to cause a stall, and then provide me with some of its latest logs?

vicsn commented 1 month ago

Experienced another validator deadlock on a low resourced test network which was spammed with transactions and deployments. Evidence of it being a deadlock was that the validator's process would not terminate after sending a SIGTERM

AleoNet / snarkOS