Deadlock with last log: "DEBUG portalnet::gossip: propagating validated content ids=["0x.."]"

ethereum / trin

An Ethereum portal client: a json-rpc server with nearly instant sync, and low CPU & storage usage

362 stars 111 forks source link

Deadlock with last log: "DEBUG portalnet::gossip: propagating validated content ids=["0x.."]" #1390

Open morph-dev opened 3 weeks ago

morph-dev commented 3 weeks ago

While on flamingo rotation, I noticed on glados that several machines are stuck.

I checked their logs and the very last 2 log messages on them were:

DEBUG discv5::service: Received RPC response: Response: Response 050108000000<redacted> to request: TALK: protocol: 500b, request: 040400000000<redacted> from: Node: 0x<redacted>, addr: <redacted>:9009
DEBUG portalnet::gossip: propagating validated content ids=["0x<redacted>"]

(the <redacted> wasn't the same but it seems irrelevant)

After restarting the docker images, they keep working fine.

It seems to me that there is some deadlock happening, most likely in the same place (and most likely during gossiping), but further investigation is needed.

morph-dev commented 3 weeks ago

These lines in portalnet/src/gossip.rs look like a potential candidates for the deadlock:

let permit = match utp_controller {
    Some(ref utp_controller) => match utp_controller.get_outbound_semaphore() {
        Some(permit) => Some(permit),
        None => continue,
    },
    None => None,
};

KolbyML commented 3 weeks ago

These lines in portalnet/src/gossip.rs look like a potential candidates for the deadlock:


let permit = match utp_controller {

    Some(ref utp_controller) => match utp_controller.get_outbound_semaphore() {

        Some(permit) => Some(permit),

        None => continue,

    },

    None => None,

};

try_acquire_owned() doesn't block so that rules out that candidate

morph-dev commented 3 weeks ago

Two more nodes got stuck, containing my last PR with log messages, so I was able to conclude that deadlock is happening at this line: portalnet/src/gossip.rs#L61

let kbuckets = kbuckets.read();

The documentation says: Note that attempts to recursively acquire a read lock on a RwLock when the current thread already holds one may result in a deadlock.

With that being said, it seems that either this thread already holds the lock, or something else is stuck in a deadlock and holds write lock indefinitely. I'm more inclined to think that it's the first, considering that last log is always the same (otherwise I would assume something else would get stuck as well, leading to different message).