ChainSafe / lodestar

🌟 TypeScript Implementation of Ethereum Consensus
https://lodestar.chainsafe.io
Apache License 2.0
1.18k stars 289 forks source link

Network worker not shutting down #5775

Open nflaig opened 1 year ago

nflaig commented 1 year ago

Describe the bug

The network worker thread is not shutting down in some cases.

Force closing (kill) does not help either because worker does receive signal.

Jul-17 17:08:54.000[]                 info: Synced - slot: 6092144 - head: 0xe00e…5e94 - exec-block: valid(9361504 0x8772…) - finalized: 0x7710…765c:190377 - peers: 51
^CJul-17 17:09:01.307[]                 info: Stopping gracefully, use Ctrl+C again to force process exit
Jul-17 17:09:06.176[]                 info: Synced - slot: 6092145 - head: (slot -1) 0xe00e…5e94 - exec-block: valid(9361504 0x8772…) - finalized: 0x7710…765c:190377 - peers: 50
^CJul-17 17:09:10.028[]                 info: Forcing process exit
^C^C^C^C^C./lodestar: line 7: 2357120 Killed                  node --trace-deprecation --max-old-space-size=4096 ./packages/cli/bin/lodestar.js "$@

It is pretty clear based on observed logs that the issue is the network worker.

"terminating network worker" is logged https://github.com/ChainSafe/lodestar/blob/5116493e87a9fe2102ebc7a3ff77677c90da3ddf/packages/beacon-node/src/network/core/networkCoreWorkerHandler.ts#L129

but "terminated network worker" is not https://github.com/ChainSafe/lodestar/blob/5116493e87a9fe2102ebc7a3ff77677c90da3ddf/packages/beacon-node/src/network/core/networkCoreWorkerHandler.ts#L131

I checked the log files, there is nothing that peaks out, likely just the same issue we see on the main thread with libp2p

Can likely be resolved by calling process.exit on worker explicitly or force closing it using another approach.

Expected behavior

Network worker should shut down cleanly and not hang the main process.

Steps to reproduce

Run Lodestar beacon node with --network.useWorker true flag.

Additional context

The problem seems to be with libp2p which in the end must take care of closing all connections / removing tcp listeners. There are several closed but also open issues regarding connections not being closed properly.

This comment https://github.com/libp2p/js-libp2p/issues/436#issuecomment-432624680 summarizes open tasks but there was no progress in a while.

Operating system

Linux

Lodestar version or commit hash

ec8153153c83d082cb89fca682d502e709e3f415

nflaig commented 1 year ago

Reopening until we can confirm this is fixed by upgrading libp2p, see https://github.com/ChainSafe/lodestar/issues/5642#issuecomment-1761433473.

nflaig commented 7 months ago

This is still an issue in the latest release as confirmed by seamonkey

nflaig commented 7 months ago

Might be resolved in the next node release, see fix https://github.com/nodejs/node/pull/51526 and another related issue https://github.com/tinylibs/tinypool/issues/54

wemeetagain commented 4 weeks ago

@nflaig I haven't heard of this happening lately, stale?

nflaig commented 4 weeks ago

This is kinda related to https://github.com/ChainSafe/lodestar/issues/6053 as well, I haven't seen this myself in a while which is good because hanging work was a pain to deal with.

Should be observed more closely and can probably close it

we also wanna consider removing the timeout if we don't see it anymore https://github.com/ChainSafe/lodestar/blob/cf722195b5ebabeb703eb973b2a9b6c70bc32d71/packages/beacon-node/src/network/core/networkCore.ts#L269-L272