Open morph-dev opened 3 weeks ago
These lines in portalnet/src/gossip.rs look like a potential candidates for the deadlock:
let permit = match utp_controller {
Some(ref utp_controller) => match utp_controller.get_outbound_semaphore() {
Some(permit) => Some(permit),
None => continue,
},
None => None,
};
These lines in portalnet/src/gossip.rs look like a potential candidates for the deadlock:
let permit = match utp_controller { Some(ref utp_controller) => match utp_controller.get_outbound_semaphore() { Some(permit) => Some(permit), None => continue, }, None => None, };
try_acquire_owned() doesn't block so that rules out that candidate
Two more nodes got stuck, containing my last PR with log messages, so I was able to conclude that deadlock is happening at this line: portalnet/src/gossip.rs#L61
let kbuckets = kbuckets.read();
The documentation says: Note that attempts to recursively acquire a read lock on a RwLock when the current thread already holds one may result in a deadlock.
With that being said, it seems that either this thread already holds the lock, or something else is stuck in a deadlock and holds write
lock indefinitely. I'm more inclined to think that it's the first, considering that last log is always the same (otherwise I would assume something else would get stuck as well, leading to different message).
While on flamingo rotation, I noticed on glados that several machines are stuck.
I checked their logs and the very last 2 log messages on them were:
(the
<redacted>
wasn't the same but it seems irrelevant)After restarting the docker images, they keep working fine.
It seems to me that there is some deadlock happening, most likely in the same place (and most likely during gossiping), but further investigation is needed.