v2.60.0-beta5 has client side datastream issues

praetoriansentry commented 2 weeks ago

I'm having connectivity issues between the RPC and the Sequencer over the datastream port using v2.60.0-beta5.

It appears that the rpc is continually resetting the connection to the datastream. The logs from the RPC are showing:

[INFO] [11-08|00:47:28.252] [3/15 Batches] Starting batches stage
[WARN] [11-08|00:47:29.252] GetHeader: readBuffer: socket error: io.ReadFull: read tcp 172.16.0.47:42912->172.16.0.5:6900: i/o timeout
[WARN] [11-08|00:47:30.253] GetHeader: readBuffer: socket error: io.ReadFull: read tcp 172.16.0.47:42918->172.16.0.5:6900: i/o timeout
[WARN] [11-08|00:47:31.254] GetHeader: readBuffer: socket error: io.ReadFull: read tcp 172.16.0.47:42926->172.16.0.5:6900: i/o timeout
[WARN] [11-08|00:47:32.254] GetHeader: readBuffer: socket error: io.ReadFull: read tcp 172.16.0.47:42940->172.16.0.5:6900: i/o timeout
[WARN] [11-08|00:47:33.255] GetHeader: readBuffer: socket error: io.ReadFull: read tcp 172.16.0.47:42942->172.16.0.5:6900: i/o timeout
[WARN] [11-08|00:47:34.257] GetHeader: readBuffer: socket error: io.ReadFull: read tcp 172.16.0.47:42944->172.16.0.5:6900: i/o timeout
[WARN] [11-08|00:47:34.257] [3/15 Batches] Failed to get latest l2 block from datastream: failed to get the L2 block within 5 attempts
[INFO] [11-08|00:47:34.257] [3/15 Batches] Finished Batches stage

cffls commented 2 weeks ago

The issue could be be mitigated by removing flag zkevm.l2-datastreamer-timeout from the config or setting it to a non-zero value, e.g.1s.

However, RPC node will catch up very slowly when it is behind. It seems like the node was stuck in getting the highest block from datastream.

Screenshot 2024-11-07 at 5 01 10 PM

cffls commented 2 weeks ago

https://github.com/0xPolygonHermez/cdk-erigon/pull/1424 addresses the slow sync issue.

cffls commented 2 weeks ago

The slow sync issue seems to be only happening in Normalcy mode, where the verified batch is always 0. As a result, if we have downloaded batches beyond the verified batch, we won't execute them immediately, but only the next batch. (see short circuit log here).

@hexoscott / @V-Staykov do you see any reason why we don't want to execute all the downloaded immediately, considering the RPC can already detect reorg and unwind automatically?

hexoscott commented 2 weeks ago

Hey @cffls. To answer the question on the short circuit this is a choice made to sanity check what the network has downloaded. For example when we boot up an RPC node we check the L1 for the latest verified batch, we can then execute all of those blocks and do a single state root check, if it matches the verification we're good. Beyond that point though we're in the wild west so we only process one batch at a time and verify the state root from the datastream, which of course will slow syncing down.

In the case of pesimistic proofs (not sure if this is normalcy or not, I lose track of names for everything as they have a habit of changing) this doesn't make sense because there aren't really any batches or verifications to work from. In this case it makes sense for short circuit code to just say "go ahead, all is fine" and let the node sync and execute everything.

hexoscott commented 2 weeks ago

The flags mentioned above will only affect the sequencer DS host, rather than the client. Some calls the client makes force a disconnect from the server for some reason (or that's behaviour we've seen) so we'll need to work around that from the looks of things.

cffls commented 2 weeks ago

Thanks @hexoscott !

A follow up question regarding this:

Beyond that point though we're in the wild west so we only process one batch at a time and verify the state root from the datastream, which of course will slow syncing down.

The rollback will be automatic if a block hash mismatches. Why not just always sync to the latest downloaded batch for both FEP and PP?

hexoscott commented 2 weeks ago

It's about trusting the DS really, we receive the data from effectively an unknown source and without verification on the L1 we check batch by batch that the state root matches the expected. If it doesn't we panic rather than have the RPC serve incorrect data.

hexoscott commented 2 weeks ago

I think it makes sense to have a mode of operation for PP to ask the node to not care about this process and keep them separate from normal zkevm/CDK networks.

Sharonbc01 commented 2 weeks ago

@cffls working on PR for for PP to add to Beta 6 @hexoscott will close this issue post beta 6 release and validation Monday am European time.

hexoscott commented 1 week ago

The code is available from beta6 onwards.

0xPolygonHermez / cdk-erigon

v2.60.0-beta5 has client side datastream issues #1423