ChainSafe / lodestar

🌟 TypeScript Implementation of Ethereum Consensus
https://lodestar.chainsafe.io
Apache License 2.0
1.16k stars 283 forks source link

Lodestar struggles to get max peers for some users #5377

Open nflaig opened 1 year ago

nflaig commented 1 year ago

Problem

It has been reported by some users that their Lodestar BN takes up to 40 minutes to get to max peers (50).

Logs

Discord

nflaig commented 1 year ago

It looks like the problem here is that the node is not able to range sync.

There are a lot of beacon_blocks_by_range errors

Apr-03 08:42:45.452[network]       verbose: Req  error method=beacon_blocks_by_range, encoding=ssz_snappy, client=Unknown, peer=16...tknFJJ, requestId=670 code=REQUEST_ERROR_DIAL_TIMEOUT
Error: REQUEST_ERROR_DIAL_TIMEOUT
    at file:///usr/app/packages/reqresp/src/request/index.ts:116:15
    at sendRequest (file:///usr/app/packages/reqresp/src/request/index.ts:104:20)
    at ReqRespBeaconNode.sendRequest (file:///usr/app/packages/reqresp/src/ReqResp.ts:152:7)
    at collectSequentialBlocksInRange (file:///usr/app/packages/beacon-node/src/network/reqresp/utils/collectSequentialBlocksInRange.ts:14:20)
    at beaconBlocksMaybeBlobsByRange (file:///usr/app/packages/beacon-node/src/network/reqresp/beaconBlocksMaybeBlobsByRange.ts:36:20)
    at wrapError (file:///usr/app/packages/beacon-node/src/util/wrapError.ts:18:32)
    at SyncChain.sendBatch (file:///usr/app/packages/beacon-node/src/sync/range/chain.ts:400:19)

Time to first byte timeouts

Apr-03 08:39:30.822[network]       verbose: Req  error method=beacon_blocks_by_range, encoding=ssz_snappy, client=Lighthouse, peer=16...ZDf5pb, requestId=55 code=REQUEST_ERROR_TTFB_TIMEOUT
Error: REQUEST_ERROR_TTFB_TIMEOUT
    at getError (file:///usr/app/packages/reqresp/src/request/index.ts:176:29)
    at EventTarget.abortHandler (file:///usr/app/packages/reqresp/src/utils/abortableSource.ts:26:48)
    at EventTarget.[nodejs.internal.kHybridDispatch] (node:internal/event_target:735:20)
    at EventTarget.dispatchEvent (node:internal/event_target:677:26)
    at abortSignal (node:internal/abort_controller:308:10)
    at AbortController.abort (node:internal/abort_controller:338:5)
    at Timeout.<anonymous> (file:///usr/app/packages/reqresp/src/request/index.ts:162:64)
    at listOnTimeout (node:internal/timers:569:17)
    at processTimers (node:internal/timers:512:7)

Timeout between <response_chunk> exceeded

Apr-03 08:39:35.002[sync]          verbose: Batch download error id=Finalized, startEpoch=191862, status=Downloading method=beacon_blocks_by_range, encoding=ssz_snappy, peer=16Uiu2HAmVqjEaG7SRVEe7hBmLWeyDaUoN1bSXaYppEJ3D1JeNcAH, code=REQUEST_ERROR_RESP_TIMEOUT
Error: REQUEST_ERROR_RESP_TIMEOUT
    at sendRequest (file:///usr/app/packages/reqresp/src/request/index.ts:219:13)
    at ReqRespBeaconNode.sendRequest (file:///usr/app/packages/reqresp/src/ReqResp.ts:152:7)
    at collectSequentialBlocksInRange (file:///usr/app/packages/beacon-node/src/network/reqresp/utils/collectSequentialBlocksInRange.ts:14:20)
    at beaconBlocksMaybeBlobsByRange (file:///usr/app/packages/beacon-node/src/network/reqresp/beaconBlocksMaybeBlobsByRange.ts:36:20)
    at wrapError (file:///usr/app/packages/beacon-node/src/util/wrapError.ts:18:32)
    at SyncChain.sendBatch (file:///usr/app/packages/beacon-node/src/sync/range/chain.ts:400:19)

Lodestar sents beacon_blocks_by_range requests to nodes where the connection is already being closed

Apr-03 08:42:19.636[network]       verbose: Req  error method=beacon_blocks_by_range, encoding=ssz_snappy, client=Teku, peer=16...fTwKhq, requestId=659 code=REQUEST_ERROR_DIAL_ERROR, error=the connection is being closed
Error: the connection is being closed
    at ConnectionImpl.newStream (file:///usr/app/node_modules/libp2p/src/connection/index.ts:110:21)
    at Libp2pNode.dialProtocol (file:///usr/app/node_modules/libp2p/src/libp2p.ts:374:29)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at runNextTicks (node:internal/process/task_queues:64:3)
    at listOnTimeout (node:internal/timers:538:9)
    at processTimers (node:internal/timers:512:7)
    at file:///usr/app/packages/reqresp/src/request/index.ts:107:22
    at withTimeout (file:///usr/app/packages/utils/src/timeout.ts:19:12)
    at sendRequest (file:///usr/app/packages/reqresp/src/request/index.ts:104:20)
    at ReqRespBeaconNode.sendRequest (file:///usr/app/packages/reqresp/src/ReqResp.ts:152:7)

Issue does not seem to be isolated to a specific client

Summary of the beacon_blocks_by_range error logs per client:

Disconnect reasons:

Disconnect reason is predominantly "Client has too many peers"

Why are there so many timeouts