Libp2p stop never resolves when shutting down node

nflaig commented 1 year ago

Describe the bug

There seems to be an issue since we have updated libp2p in https://github.com/ChainSafe/lodestar/pull/6015. In some cases, the beacon node never shuts down and just hangs indefinitely. This does not happen all the time but if the node is running for a longer time it seems to happen quite frequently.

Based on the logs it is clear that the issue is await this.libp2p.stop(); which just never resolves https://github.com/ChainSafe/lodestar/blob/c50db8f7ae47198c76070487a6cfd0009d9bfd39/packages/beacon-node/src/network/core/networkCore.ts#L271

See libp2p-shutdown-issue-full.log

devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "closing network core running in network worker"
Oct-19 11:40:56.818[network]         debug: closing network core running in network worker
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network sent goodbye to all peers"
Oct-19 11:40:57.826[network]         debug: network sent goodbye to all peers
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network peerManager closed"
Oct-19 11:40:57.838[network]         debug: network peerManager closed
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network reqResp closed"
Oct-19 11:40:57.863[network]         debug: network reqResp closed
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network lib2p closed"
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "terminating network worker"

This was also brought up by Sea Monkey on discord, corresponding logs sea-monkey-shutdown-issue.log

Expected behavior

Libp2p stop should always resolve in a timely manner

Steps to reproduce

Run beacon node on any network

Additional context

Possibly related to https://github.com/libp2p/js-libp2p/pull/2058

Operating system

Linux

Lodestar version or commit hash

unstable (ab2dfdde652b0b0e4e9b2b96f8d8ee08fec6b720)

nazarhussain commented 1 year ago

@nflaig Do you know if there were some special configuration passed to node when this behavior triggered?

nflaig commented 1 year ago

@nflaig Do you know if there were some special configuration passed to node when this behavior triggered?

I don't think it is related to specific configuration, it really just happens if the node is running for a longer while, I had the bn running now for 15 minutes and didn't see the issue, but on two consecutive runs which had an uptime of 12+ hours I had the issue. I can't tell what the minimum amount is the bn has to run or if it happens consistently.

My command

./lodestar beacon \
    --dataDir /home/devops/goerli/data/beacon \
    --rest \
    --rest.namespace '*' \
    --rest.address "0.0.0.0" \
    --metrics \
    --execution.urls http://localhost:8551 \
    --jwt-secret /home/devops/goerli/data/jwtsecret \
    --logLevel info \
    --network goerli \
    --checkpointSyncUrl "https://beaconstate-goerli.chainsafe.io/"  \
    --subscribeAllSubnets

nflaig commented 1 year ago

Just to highlight the difference of this issue compared to https://github.com/ChainSafe/lodestar/issues/5642 and https://github.com/ChainSafe/lodestar/issues/5775

What happens now is that libp2p stop never resolves, we basically have a hanging promise. The other two issues are related to an active handle that is left behind by libp2p after stop is already resolved. The active handle causes issues shutting down the main process and since we switched to a worker preventing that from being terminated by Thread.terminate which ends up as a hanging promise in the end as well.

Let me know if you need more details. As you can see based on the issues I spent quite some time debugging this as well.

wemeetagain commented 1 year ago

@achingbrain any ideas?

wemeetagain commented 1 year ago

Can we look into implementing a hack to unblock our v1.12 while we take time to figure out the root cause?

nflaig commented 1 year ago

Can we look into implementing a hack to unblock our v1.12 while we take time to figure out the root cause?

Maybe wrap libp2p.stop into a withTimeout?

https://github.com/ChainSafe/lodestar/blob/18dc0c958189cb60fd8df7db6b1ebdf060edbd96/packages/beacon-node/src/network/peers/peerManager.ts#L664

achingbrain commented 1 year ago

any ideas?

Nothing specific but I find why-is-node-running very helpful when tracking down active handles that stop processes from exiting.

nflaig commented 1 year ago

any ideas?

Nothing specific but I find why-is-node-running very helpful when tracking down active handles that stop processes from exiting.

It not related to active handles, libp2p.stop promise just never resolves

achingbrain commented 1 year ago

The libp2p.stop method just stops each component. Do you have any logs available to see which component is causing the problem?

nflaig commented 1 year ago

The libp2p.stop method just stops each component. Do you have any logs available to see which component is causing the problem?

The issue is not that easy to reproduce, right now I only have the logs mentioned here https://github.com/ChainSafe/lodestar/issues/6053#issue-1952057936. I will run Lodestar with DEBUG="libp2p:*" and see if I can catch it again.

nflaig commented 1 year ago

@achingbrain I caught some DEBUG logs (libp2p-debug.log), I didn't see any libp2p:* logs afterwards but the libp2p.stop promise never resolved, and the process was not shutting down due to that.

You can ignore the Error: PublishError.InsufficientPeers logs, this seems to be an issue that our REST API is still allowing requests to come in even though the network is closing.

achingbrain commented 1 year ago

It's not clear from the logs when shutdown is started (this line is not present) but it looks like it's still accepting incoming connections so perhaps the tcp socket isn't being closed properly now.

Does it happen with @libp2p/tcp from before https://github.com/libp2p/js-libp2p/pull/2058? E.g. when pinned to 8.0.7

Or could one of these lines be throwing and the error swallowed?

nflaig commented 1 year ago

It's not clear from the logs when shutdown is started (this line is not present)

Was capturing the logs in tmux which seems to only capture the last 2k lines...

Does it happen with @libp2p/tcp from before https://github.com/libp2p/js-libp2p/pull/2058? E.g. when pinned to 8.0.7

It happens since https://github.com/ChainSafe/lodestar/pull/6015 which bumped @libp2p/tcp from 8.0.2 to 8.0.8 but it also includes a bunch of other libp2p dependency updates

Or could one of these lines be throwing and the error swallowed?

No, those are really harmless and just remove listeners. Code has not been modified there recently

E.g. when pinned to 8.0.7

I can try pinning @libp2p/tcp to 8.0.7 but the problem is that it is only easy to confirm that the issue still exists, can't really confirm that it doesn't. We had issues before with libp2p actives handles (https://github.com/ChainSafe/lodestar/issues/5642 and https://github.com/ChainSafe/lodestar/issues/5775) and haven't seen those for a week and regarded them as resolved, but then caught them again...

nflaig commented 1 year ago

Does it happen with @libp2p/tcp from before https://github.com/libp2p/js-libp2p/pull/2058? E.g. when pinned to 8.0.7

Haven't seen the issue so far when downgrading @libp2p/tcp to 8.0.7, it's been only 3 days so can't tell for sure, but seems to be a high probability that the issue was introduced in @libp2p/tcp v8.0.8

nflaig commented 1 year ago

@achingbrain it's been a week now and I haven't seen the issue, whereas before it would happen at least once a day on a long running beacon node process. If https://github.com/libp2p/js-libp2p/pull/2058 was the only change which got included in v8.0.8 I'm pretty confident that it is the cause of the problem.

achingbrain commented 1 year ago

The release notes for @libp2p/tcp@8.0.8 are here and the only change was https://github.com/libp2p/js-libp2p/pull/2058

The change in that PR was around TCP shutdown and since lodestar is now failing to shut down it seems like a thread worth pulling on.

@nazarhussain @wemeetagain @maschad - can you please take a look and see if the changes in that PR have introduced this issue?

nflaig commented 6 months ago

@achingbrain is https://github.com/libp2p/js-libp2p/pull/2421 expected to fix this issue?

I can still see timeouts due to libp2p.stop resolving but our unstable branch should have that change as we recently updated @libp2p/tcp in https://github.com/ChainSafe/lodestar/pull/6676.

This issue is quite easy to reproduce now since we bumped our peer count to 100 as it became much more frequent.

Was thinking maybe 5s is no longer sufficient for the timeout but removing it (https://github.com/ChainSafe/lodestar/compare/unstable...nflaig/remove-libp2p-stop-timeout) introduces this issue again where the process is just stuck, as promise never resolves.

nflaig commented 2 weeks ago

Might be related to https://github.com/libp2p/js-libp2p/issues/2760

philknows commented 1 week ago

Issue currently blocked until we can get js-libp2p 2.0 back in without issues.

ChainSafe / lodestar