Open nflaig opened 1 year ago
@nflaig Do you know if there were some special configuration passed to node when this behavior triggered?
@nflaig Do you know if there were some special configuration passed to node when this behavior triggered?
I don't think it is related to specific configuration, it really just happens if the node is running for a longer while, I had the bn running now for 15 minutes and didn't see the issue, but on two consecutive runs which had an uptime of 12+ hours I had the issue. I can't tell what the minimum amount is the bn has to run or if it happens consistently.
My command
./lodestar beacon \
--dataDir /home/devops/goerli/data/beacon \
--rest \
--rest.namespace '*' \
--rest.address "0.0.0.0" \
--metrics \
--execution.urls http://localhost:8551 \
--jwt-secret /home/devops/goerli/data/jwtsecret \
--logLevel info \
--network goerli \
--checkpointSyncUrl "https://beaconstate-goerli.chainsafe.io/" \
--subscribeAllSubnets
Just to highlight the difference of this issue compared to https://github.com/ChainSafe/lodestar/issues/5642 and https://github.com/ChainSafe/lodestar/issues/5775
What happens now is that libp2p stop never resolves, we basically have a hanging promise. The other two issues are related to an active handle that is left behind by libp2p after stop is already resolved. The active handle causes issues shutting down the main process and since we switched to a worker preventing that from being terminated by Thread.terminate
which ends up as a hanging promise in the end as well.
Let me know if you need more details. As you can see based on the issues I spent quite some time debugging this as well.
@achingbrain any ideas?
Can we look into implementing a hack to unblock our v1.12 while we take time to figure out the root cause?
Can we look into implementing a hack to unblock our v1.12 while we take time to figure out the root cause?
Maybe wrap libp2p.stop
into a withTimeout
?
any ideas?
Nothing specific but I find why-is-node-running very helpful when tracking down active handles that stop processes from exiting.
any ideas?
Nothing specific but I find why-is-node-running very helpful when tracking down active handles that stop processes from exiting.
It not related to active handles, libp2p.stop
promise just never resolves
The libp2p.stop method just stops each component. Do you have any logs available to see which component is causing the problem?
The libp2p.stop method just stops each component. Do you have any logs available to see which component is causing the problem?
The issue is not that easy to reproduce, right now I only have the logs mentioned here https://github.com/ChainSafe/lodestar/issues/6053#issue-1952057936. I will run Lodestar with DEBUG="libp2p:*"
and see if I can catch it again.
@achingbrain I caught some DEBUG logs (libp2p-debug.log), I didn't see any libp2p:*
logs afterwards but the libp2p.stop
promise never resolved, and the process was not shutting down due to that.
You can ignore the Error: PublishError.InsufficientPeers
logs, this seems to be an issue that our REST API is still allowing requests to come in even though the network is closing.
It's not clear from the logs when shutdown is started (this line is not present) but it looks like it's still accepting incoming connections so perhaps the tcp socket isn't being closed properly now.
Does it happen with @libp2p/tcp
from before https://github.com/libp2p/js-libp2p/pull/2058? E.g. when pinned to 8.0.7
Or could one of these lines be throwing and the error swallowed?
It's not clear from the logs when shutdown is started (this line is not present)
Was capturing the logs in tmux which seems to only capture the last 2k lines...
Does it happen with @libp2p/tcp from before https://github.com/libp2p/js-libp2p/pull/2058? E.g. when pinned to 8.0.7
It happens since https://github.com/ChainSafe/lodestar/pull/6015 which bumped @libp2p/tcp
from 8.0.2
to 8.0.8
but it also includes a bunch of other libp2p dependency updates
Or could one of these lines be throwing and the error swallowed?
No, those are really harmless and just remove listeners. Code has not been modified there recently
E.g. when pinned to 8.0.7
I can try pinning @libp2p/tcp
to 8.0.7
but the problem is that it is only easy to confirm that the issue still exists, can't really confirm that it doesn't. We had issues before with libp2p actives handles (https://github.com/ChainSafe/lodestar/issues/5642 and https://github.com/ChainSafe/lodestar/issues/5775) and haven't seen those for a week and regarded them as resolved, but then caught them again...
Does it happen with @libp2p/tcp from before https://github.com/libp2p/js-libp2p/pull/2058? E.g. when pinned to 8.0.7
Haven't seen the issue so far when downgrading @libp2p/tcp to 8.0.7, it's been only 3 days so can't tell for sure, but seems to be a high probability that the issue was introduced in @libp2p/tcp v8.0.8
@achingbrain it's been a week now and I haven't seen the issue, whereas before it would happen at least once a day on a long running beacon node process. If https://github.com/libp2p/js-libp2p/pull/2058 was the only change which got included in v8.0.8 I'm pretty confident that it is the cause of the problem.
The release notes for @libp2p/tcp@8.0.8
are here and the only change was https://github.com/libp2p/js-libp2p/pull/2058
The change in that PR was around TCP shutdown and since lodestar is now failing to shut down it seems like a thread worth pulling on.
@nazarhussain @wemeetagain @maschad - can you please take a look and see if the changes in that PR have introduced this issue?
@achingbrain is https://github.com/libp2p/js-libp2p/pull/2421 expected to fix this issue?
I can still see timeouts due to libp2p.stop
resolving but our unstable branch should have that change as we recently updated @libp2p/tcp
in https://github.com/ChainSafe/lodestar/pull/6676.
This issue is quite easy to reproduce now since we bumped our peer count to 100 as it became much more frequent.
Was thinking maybe 5s is no longer sufficient for the timeout but removing it (https://github.com/ChainSafe/lodestar/compare/unstable...nflaig/remove-libp2p-stop-timeout) introduces this issue again where the process is just stuck, as promise never resolves.
Issue currently blocked until we can get js-libp2p 2.0 back in without issues.
Describe the bug
There seems to be an issue since we have updated libp2p in https://github.com/ChainSafe/lodestar/pull/6015. In some cases, the beacon node never shuts down and just hangs indefinitely. This does not happen all the time but if the node is running for a longer time it seems to happen quite frequently.
Based on the logs it is clear that the issue is
await this.libp2p.stop();
which just never resolves https://github.com/ChainSafe/lodestar/blob/c50db8f7ae47198c76070487a6cfd0009d9bfd39/packages/beacon-node/src/network/core/networkCore.ts#L271See libp2p-shutdown-issue-full.log
This was also brought up by Sea Monkey on discord, corresponding logs sea-monkey-shutdown-issue.log
Expected behavior
Libp2p stop should always resolve in a timely manner
Steps to reproduce
Run beacon node on any network
Additional context
Operating system
Linux
Lodestar version or commit hash
unstable (ab2dfdde652b0b0e4e9b2b96f8d8ee08fec6b720)