0xPolygonHermez / cdk-erigon

Ethereum implementation on the efficiency frontier
GNU Lesser General Public License v3.0
35 stars 39 forks source link

RPC/Sequencer stuck under load #1461

Closed xavier-romero closed 3 days ago

xavier-romero commented 1 week ago

A partner reported issue with sequencer stopping to process transactions under high load, and they identified this issue as related to db.read.concurrency configuration. They report that increasing that number the issue goes away. I did not reproduce exactly, but very similar scenario though. So, I've set db.read.concurrency to 5 to be able to reach the "high load" easily, then sending simple EOA transfers to the RPC, it gets stuck after few txs. image image

Even after stopping the txs for hours and with no activity at all, the sequencer remains "stuck". image

Sharonbc01 commented 1 week ago

@xavier-romero can you confirm which Fork this bug was observed on please?

Sharonbc01 commented 1 week ago

@mandrigin will look into but not seen as a showstopper

praetoriansentry commented 1 week ago

TODO, we should collect a pprof/goroutined dump when the instance is stuck.

hexoscott commented 1 week ago

Deja vu - we fixed this in cdk-erigon-lib but as part of the upstream merge of 2.60 the updates to MDBX weren't ported over.

Sharonbc01 commented 1 week ago

Fix in beta 10 to validate.

giskook commented 6 days ago

With this fix https://github.com/0xPolygonHermez/cdk-erigon/pull/1472 , X Layer erigon RPC still stuck.

giskook commented 5 days ago

With https://github.com/0xPolygonHermez/cdk-erigon/releases/tag/v2.60.0-beta10 and following configure, the rpc will stuck.

datadir: /data/erigon-data/xlayer-mainnet
chain: xlayer-mainnet
http: true
private.api.addr: localhost:18091
zkevm.l2-chain-id: 196
zkevm.l2-sequencer-rpc-url: https://rpc.xlayer.tech
zkevm.l2-datastreamer-url: stream.xlayer.tech:8800
zkevm.l1-chain-id: 1
zkevm.l1-rpc-url: https://rpc.ankr.com/eth/{replace to your eth rpc}

zkevm.address-sequencer: "0xAF9d27ffe4d51eD54AC8eEc78f2785D7E11E5ab1"
zkevm.address-zkevm: "0x2B0ee28D4D51bC9aDde5E58E295873F61F4a0507"
zkevm.address-rollup: "0x5132A183E9F3CB7C848b0AAC5Ae0c4f0491B7aB2"
zkevm.address-ger-manager: "0x580bda1e7A0CFAe92Fa7F6c20A3794F169CE3CFb"

zkevm.l1-rollup-id: 3
zkevm.l1-first-block: 19218658
zkevm.l1-block-range: 2000
zkevm.l1-query-delay: 1000
zkevm.datastream-version: 3

http.api: [eth, debug, net, trace, web3, erigon, zkevm]
http.addr: 0.0.0.0
http.port: 28544
hexoscott commented 4 days ago

Is this a problem with syncing and holding the network tip @giskook ? I see you mentioned eth_getLogs crashing which is a different issue altogether.

giskook commented 4 days ago

Is this a problem with syncing and holding the network tip @giskook ? I see you mentioned eth_getLogs crashing which is a different issue altogether.

Maybe it's a different issue, let's figure out the stuck one first.

giskook commented 4 days ago

curl http://127.0.0.1:47050/debug/pprof/goroutine?debug=1 > goroutines.log goroutines.log

curl http://127.0.0.1:47050/debug/pprof/profile?seconds=60 > pprof.bin pprof.bin.log

Sharonbc01 commented 4 days ago

Igor noted this is specific to Xlayer

Sharonbc01 commented 3 days ago

@hexoscott will close this issue as RPC / Sequencer stuck is resolved. Scott will open a new issue for a specific OKX issue.

hexoscott commented 3 days ago

Closing this down as the deadlock problem in the sequencer is now fixed, I have opened #1485 to tackle the RPC syncing issue as something separate.