0xPolygonHermez / zkevm-node

Go implementation of a node that operates the Polygon zkEVM Network
Other
524 stars 675 forks source link

Potential seq halt risk #3630

Open giskook opened 3 months ago

giskook commented 3 months ago

System information

zkEVM Node version: v0.6.7 OS & Version: Linux Commit hash : (-) Network: Mainnet/Testnet

Expected behaviour

Actual behaviour

X Layer(node 0.6.7) under load test will halt sequencer

Steps to reproduce the behaviour

Under load test, when create a new l2block, because the prev lock is not proccessed, so the new l2block may halt the sequencer.

Backtrace

node's log (X Layer's, but the logical is the same)

/src/log/log.go:142 github.com/0xPolygonHermez/zkevm-node/log.appendStackTraceMaybeArgs()
/src/log/log.go:251 github.com/0xPolygonHermez/zkevm-node/log.Errorf()
/src/sequencer/finalizer.go:957 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).Halt()
/src/sequencer/l2block.go:646 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).openNewWIPL2Block()
/src/sequencer/l2block.go:529 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).finalizeWIPL2Block()
/src/sequencer/finalizer.go:399 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).finalizeBatches()
/src/sequencer/finalizer.go:186 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).Start()
)","pid":1,"version":"v0.3.9-5-gbcd78f1b","stacktrace":"github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).Halt
    /src/sequencer/finalizer.go:957
github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).openNewWIPL2Block
    /src/sequencer/l2block.go:646
github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).finalizeWIPL2Block
    /src/sequencer/l2block.go:529
github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).finalizeBatches
    /src/sequencer/finalizer.go:399
github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).Start
    /src/sequencer/finalizer.go:186"

or

failed to execute new wip L2 block [274], error: key not found in the database 
/src/log/log.go:142 github.com/0xPolygonHermez/zkevm-node/log.appendStackTraceMaybeArgs()
/src/log/log.go:251 github.com/0xPolygonHermez/zkevm-node/log.Errorf()
/src/sequencer/finalizer.go:957 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).Halt()
/src/sequencer/l2block.go:646 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).openNewWIPL2Block()
/src/sequencer/batch.go:206 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).finalizeWIPBatch()
/src/sequencer/finalizer.go:470 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).finalizeBatches()
/src/sequencer/finalizer.go:186 github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).Start()
)","pid":1,"version":"v0.3.9-1-g01328c71","stacktrace":"github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).Halt
    /src/sequencer/finalizer.go:957
github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).openNewWIPL2Block
    /src/sequencer/l2block.go:646
github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).finalizeWIPBatch
    /src/sequencer/batch.go:206
github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).finalizeBatches
    /src/sequencer/finalizer.go:470
github.com/0xPolygonHermez/zkevm-node/sequencer.(*finalizer).Start
    /src/sequencer/finalizer.go:186"

executor's log

20240513_044417_037003 2999306 df7f640 ExecutorServiceImpl::ProcessBatchV2() got sequencerAddr=d6dda5aa7749142b7fda3fe4662c9f346101b8a6 batchL2DataLength=9 batchL2Data=0x0b0000000300000000...0b0000000300000000 oldStateRoot=3f39566612c98487a4cb091740077f56aff652512fe5401bf7c2d25b4c925fdd oldAccInputHash=0 oldBatchNum=573763 chainId=195 forkId=9 generated globalExitRoot=0 timestampLimit=1715575457 from=0x bUpdateMerkleTree=0 bNoCounters=0 bGetKeys=0 bSkipVerifyL1InfoRoot=1 bSkipFirstChangeL2Block=0 bSkipWriteBlockInfoRoot=1 traceConfig=bEnabled=0,bDisableStorage=0,bDisableStack=0,bEnableMemory=0,bEnableReturnData=0,txHashToGenerateFullTrace= UUID=f83e4c2b-e84b-42b7-9f62-00dfcc803642 gasLimit=0 l1InfoTreeData.size=1=120975, stateOverride.size=0

20240513_044419_537014 2999306 df7f640 zkWarning: Database::read() failed calling readRemote() with error=ZKR_DB_KEY_NOT_FOUND; will retry after 100000us key=59f00c7c75346d1b11c70e26613ad99ffbe59f486e63bdcd6b4e25aaac3d7a93 i=9

20240513_044419_662293 2999306 df7f640 zkError: MainExecutor::logError() proverRequest.result=ZKR_DB_KEY_NOT_FOUND step=0 eval=153 zkPC=11880 rom.line={ inFREE=1 inFREE0=0 freeInTag={} jmpAddr=13163 call=1 useJmpAddr=1 setSR=1 sWR=1 fileName=process-change-l2-block.zkasm line=36 lineStr=$=> SR:SSTORE, CALL(initBlockInfoTree)} uuid=f83e4c2b-e84b-42b7-9f62-00dfcc803642

When submitting logs: please submit them as text and not screenshots.

giskook commented 3 months ago

fixed pr: https://github.com/0xPolygonHermez/zkevm-node/pull/3629

giskook commented 3 months ago

It seems increase sequencer's executor's dbMTCacheSize can cover this situation.

giskook commented 3 months ago

Only happend when sequencer set SequentialProcessL2Block to false

giskook commented 3 months ago

I think from this pr: https://github.com/0xPolygonHermez/zkevm-node/pull/3604 sequencer may halt under loadtest situation, root cause: wip batch (tx-by-tx process) will build his state on executor's dbMTCache, but dbMTCache is a LRU, so after StateRootSyncInterval time running, old state may dropped by executor. Then when sequencer open a new L2block, it may fail because of missing key.

Way to solve:

X Layer testnet may halt following below configuration: sequencer

SequentialProcessL2Block = false
Sequencer.Finalizer.StateRootSyncInterval = "3600s"

executor

    "dbMTCacheSize": 1024,
agnusmor commented 3 months ago

Hi, this is a known issue with the associative cache in executor/prover, to fix it you need to disable it (useAssociativeCache = false) in executor configuration. The prover team is working to fix the issue.

giskook commented 3 months ago

useAssociativeCache

Our executor do not set useAssociativeCache, after check the prover's code, i think it's default to false.

giskook commented 3 months ago

Executor‘s configure

{
    "runExecutorServer": true,
    "runExecutorClient": false,
    "runExecutorClientMultithread": false,

    "runHashDBServer": true,
    "runHashDBTest": false,

    "runAggregatorServer": false,
    "runAggregatorClient": false,

    "runFileGenBatchProof": false,
    "runFileGenAggregatedProof": false,
    "runFileGenFinalProof": false,
    "runFileProcessBatch": false,
    "runFileProcessBatchMultithread": false,
    "runFileExecutor": false,

    "runKeccakScriptGenerator": false,
    "runKeccakTest": false,
    "runStorageSMTest": false,
    "runBinarySMTest": false,
    "runMemAlignSMTest": false,
    "runSHA256Test": false,
    "runBlakeTest": false,

    "executeInParallel": true,
    "useMainExecGenerated": true,
    "useProcessBatchCache": false,

    "saveRequestToFile": false,
    "saveInputToFile": false,
    "saveDbReadsToFile": false,
    "saveDbReadsToFileOnChange": false,
    "saveOutputToFile": false,
    "saveProofToFile": false,
    "saveResponseToFile": false,
    "saveFilesInSubfolders": false,

    "loadDBToMemCache": false,
    "loadDBToMemCacheInParallel": false,
    "loadDBToMemTimeout": 30000000,
    "dbMTCacheSize": 1024,
    "dbProgramCacheSize": 1024,

    "opcodeTracer": false,
    "logRemoteDbReads": false,
    "logExecutorServerInput": false,
    "logExecutorServerInputGasThreshold": 1048576,
    "logExecutorServerResponses": false,
    "logExecutorServerTxs": false,

    "executorServerPort": 50071,
    "executorROMLineTraces": false,
    "executorTimeStatistics": false,
    "executorClientPort": 50071,
    "executorClientHost": "127.0.0.1",
    "executorClientLoops": 1,

    "hashDBServerPort": 50061,
    "hashDBURL": "local",

    "aggregatorServerPort": 50081,
    "aggregatorClientPort": 50081,
    "aggregatorClientHost": "127.0.0.1",
    "aggregatorClientMockTimeout": 10000000,
    "aggregatorClientWatchdogTimeout": 60000000,

    "mapConstPolsFile": false,
    "mapConstantsTreeFile": false,

    "inputFile": "testvectors/performance/121_tx_input.json",
    "inputFile2": "testvectors/aggregatedProof/recursive1.zkin.proof_1.json",

    "outputPath": "runtime/output",
    "configPath": "config",

    "zkevmCmPols_disabled": "runtime/zkevm.commit",
    "zkevmCmPolsAfterExecutor_disabled": "runtime/zkevm.commit",
    "c12aCmPols": "runtime/c12a.commit",
    "recursive1CmPols_disabled": "runtime/recursive1.commit",
    "recursive2CmPols_disabled": "runtime/recursive2.commit",
    "recursivefCmPols_disabled": "runtime/recursivef.commit",
    "finalCmPols_disabled": "runtime/final.commit",

    "publicsOutput": "public.json",
    "proofFile": "proof.json",

    "databaseURL": "postgresql://xgon_prover_user:xxxxx@xgon-state-pap.cluster-cpvlnksujjmh.ap-east-1.rds.amazonaws.com:5432/xgon_prover_db",
    "dbNodesTableName": "state.nodes",
    "dbProgramTableName": "state.program",
    "dbMultiWrite": true,
    "dbFlushInParallel": false,
    "dbConnectionsPool": true,
    "dbNumberOfPoolConnections": 506,
    "dbMetrics": false,
    "dbClearCache": false,
    "dbGetTree": true,
    "dbReadRetryDelay": 100000,
    "cleanerPollingPeriod": 600,
    "requestsPersistence": 3600,
    "maxExecutorThreads": 100,
    "maxProverThreads": 100,
    "maxHashDBThreads": 200
}