Open HarukaMa opened 8 months ago
This one feels like it's going to be tricky, but I'll try to investigate it soon.
We did initial passes, but were unable to reproduce this. Putting this on a lower priority, but will keep and eye out and revisit this.
@HarukaMa I've prepared a branch that's aimed at detecting deadlocks; could you try it out with one of your nodes under a workload that's likely to cause a stall, and then provide me with some of its latest logs?
Experienced another validator deadlock on a low resourced test network which was spammed with transactions and deployments. Evidence of it being a deadlock was that the validator's process would not terminate after sending a SIGTERM
🐛 Bug Report
There is a rayon-related deadlock in snarkOS, but I'm not quite sure which situation it actually is:
spawn_blocking
applies here). Maybe see this or this.I think it's probably the first one, as from a deadlock core dump, I did see write lock being acquired while the node stuck at a read lock. Here is the full backtrace of all threads. (Large text file as rayon tend to generate a deep stack. The file is actually .7z but has to be named .zip to upload here.) Notice the thread 69 has the write lock to
vm.process
while trying to advance a block, while there are many threads trying to validate incoming unconfirmed transactions and needed a read lock.Steps to Reproduce
Not sure. Run the node with a large number of connections?
Expected Behavior
The node should not deadlock.
Your Environment