0xPolygonHermez / cdk-erigon

Ethereum implementation on the efficiency frontier
GNU Lesser General Public License v3.0
35 stars 39 forks source link

[RPC] stuck during syncing #1485

Open hexoscott opened 3 days ago

hexoscott commented 3 days ago

During syncing we see the batches stage getting stuck and beyond that it makes no progress. The pattern appears as the image below:

telegram-cloud-photo-size-5-6188431868307815870-y

It looks as the though the stream client is missing some cleanup code at either the start of a new stage or at the end before the stage is completed.

zjg555543 commented 2 days ago

Detail logs: cdk-error.log

The if err := r.connectDatastream(); err != nil was been deleted, so we need to retry again while got an disconnect?

https://github.com/0xPolygonHermez/cdk-erigon/pull/1297/files#diff-c43323397404731c60022d8f8de469c44812cfa36b82ec7fd99dc429e1b5803eL40

Vui-Chee commented 2 days ago

Just to add another thought to this discussion, in the implementation of StreamClient, whenever you read from the TCP connection (readBuffer), I noticed you set the read deadline (SetReadDeadline) for the connection. The issues we have seeing shows the error (i/o timeout).

Before you start reading full blocks, net.Dial is successful, meaning there is already a successful TCP connection to begin with, thereby the data stream client is successfully constructed. Is is possible the stream client may have prematurely shut down its TCP connection during the read process, thereby no further entries are written to entryChan (causing batch process loop to stall)?

hexoscott commented 2 days ago

Hi @Vui-Chee - this has been a long journey on the stream client. Certain calls to the datastream host will terminate the connection unexpectedly and we get occasional drops due to inactivity etc. We're continuing to investigate

ToniRamirezM commented 1 day ago

Regarding the inactivity, check you have set a value for InactivityTimeout and InactivityCheckInterval

https://github.com/0xPolygon/zkevm-data-streamer/blob/main/datastreamer/config.go#L18

giskook commented 1 day ago

Regarding the inactivity, check you have set a value for InactivityTimeout and InactivityCheckInterval

Our data-streamer are on v0.2.3-RC4 So we do not have such configuration yet

hexoscott commented 1 day ago

There is a fix inbound for this, just going through CI now

hexoscott commented 1 day ago

Ref: #1492

giskook commented 1 day ago

Ref: #1492

Cool, many thanks, I will try this fix.