ethereum-optimism / op-geth

GNU Lesser General Public License v3.0
255 stars 654 forks source link

After pruning, op-geth prompts an error log "missing trie node" #220

Closed welkin22 closed 5 months ago

welkin22 commented 5 months ago

Hi, team, I am the developer of opBNB. I am trying to execute the pruning function of op-geth, because as the data increases, our node's disk usage is getting larger and larger. I noticed that op-geth retains relevant code about pruning from geth, so I thought it should be able to work directly. However, when I tried to execute the pruning command and then restart the node, the node reported an error: "State not available, ignoring new payload", and the block height was stuck. I use optimism in the latest code on the optimism branch, start devnet locally, and try to reproduce this issue. I found that the problem can be reproduced, but the error message is different. op-geth will directly report an error: "missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"

System information

Geth version: optimism branch(commit hash: ec72ec6a285e85761a21d6700af8c52513fed9eb) CL client & version: optimism b4c313da5c93e8c5109954ef3b2ec4f67a913579 OS & Version: macOS Commit hash : ec72ec6a285e85761a21d6700af8c52513fed9eb

Expected behaviour

After executing the pruning command, the node is able to run normally.

Actual behaviour

"missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found" is reported in the op-geth log, and the block height is stuck.

Steps to reproduce the behaviour

  1. Shut down the nodes that need to be pruned.
  2. Execute the pruning command: geth snapshot prune-state --datadir {the data dir of your node}.
  3. See the log of successful pruning: "State pruning successful".
  4. Start the nodes and notice that op-geth will roll back the unsafe block height by 127 blocks and continue to produce blocks based on this.
  5. After the unsafe height produces 1 block, the second block will be stuck and continuously print the error log: "missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found".

Backtrace

l2-1  | + VERBOSITY=3
l2-1  | + GETH_DATA_DIR=/db
l2-1  | + GETH_CHAINDATA_DIR=/db/geth/chaindata
l2-1  | + GENESIS_FILE_PATH=/genesis.json
l2-1  | + cat /genesis.json
l2-1  | + jq -r .config.chainId
l2-1  | + CHAIN_ID=901
l2-1  | + RPC_PORT=8545
l2-1  | + WS_PORT=8546
l2-1  | + '[' '!' -d /db/geth/chaindata ]
l2-1  | + echo '/db/geth/chaindata exists.'
l2-1  | /db/geth/chaindata exists.
l2-1  | + exec geth '--datadir=/db' '--verbosity=3' --http '--http.corsdomain=*' '--http.vhosts=*' '--http.addr=0.0.0.0' '--http.port=8545' '--http.api=web3,debug,eth,txpool,net,engine' --ws '--ws.addr=0.0.0.0' '--ws.port=8546' '--ws.origins=*' '--ws.api=debug,eth,txpool,net,engine' '--syncmode=full' --nodiscover '--maxpeers=0' '--networkid=901' --rpc.allow-unprotected-txs '--authrpc.addr=0.0.0.0' '--authrpc.port=8551' '--authrpc.vhosts=*' '--authrpc.jwtsecret=/config/jwt-secret.txt' '--gcmode=archive' --metrics '--metrics.addr=0.0.0.0' '--metrics.port=6060' '--authrpc.jwtsecret=/config/test-jwt-secret.txt'
l2-1  | INFO [01-23|09:26:34.108] Config environment variable found        envvar=GETH_MINER_RECOMMIT
l2-1  | INFO [01-23|09:26:34.113] Enabling metrics collection 
l2-1  | INFO [01-23|09:26:34.113] Enabling stand-alone metrics HTTP endpoint address=0.0.0.0:6060
l2-1  | INFO [01-23|09:26:34.113] Starting metrics server                  addr=http://0.0.0.0:6060/debug/metrics
l2-1  | INFO [01-23|09:26:34.114] Maximum peer count                       ETH=0 LES=0 total=0
l2-1  | INFO [01-23|09:26:34.114] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
l2-1  | INFO [01-23|09:26:34.116] Enabling recording of key preimages since archive mode is used 
l2-1  | WARN [01-23|09:26:34.116] Disabled transaction unindexing for archive node 
l2-1  | INFO [01-23|09:26:34.116] Set global gas cap                       cap=50,000,000
l2-1  | INFO [01-23|09:26:34.116] Initializing the KZG library             backend=gokzg
l2-1  | INFO [01-23|09:26:34.139] Allocated trie memory caches             clean=307.00MiB dirty=0.00B
l2-1  | INFO [01-23|09:26:34.139] Using pebble as the backing database 
l2-1  | INFO [01-23|09:26:34.139] Allocated cache and file handles         database=/db/geth/chaindata cache=512.00MiB handles=524,288
l2-1  | INFO [01-23|09:26:34.149] Opened ancient database                  database=/db/geth/chaindata/ancient/chain readonly=false
l2-1  | INFO [01-23|09:26:34.150] State scheme set to already existing     scheme=hash
l2-1  | INFO [01-23|09:26:34.159]  
l2-1  | INFO [01-23|09:26:34.159] --------------------------------------------------------------------------------------------------------------------------------------------------------- 
l2-1  | INFO [01-23|09:26:34.159] Chain ID:  901 (unknown) 
l2-1  | INFO [01-23|09:26:34.159] Consensus: Optimism 
l2-1  | INFO [01-23|09:26:34.159]  
l2-1  | INFO [01-23|09:26:34.159] Pre-Merge hard forks (block based): 
l2-1  | INFO [01-23|09:26:34.159]  - Homestead:                   #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/homestead.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Tangerine Whistle (EIP 150): #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/tangerine-whistle.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Spurious Dragon/1 (EIP 155): #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/spurious-dragon.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Spurious Dragon/2 (EIP 158): #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/spurious-dragon.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Byzantium:                   #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/byzantium.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Constantinople:              #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/constantinople.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Petersburg:                  #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/petersburg.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Istanbul:                    #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/istanbul.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Muir Glacier:                #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/muir-glacier.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Berlin:                      #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/berlin.md) 
l2-1  | INFO [01-23|09:26:34.159]  - London:                      #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/london.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Arrow Glacier:               #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/arrow-glacier.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Gray Glacier:                #0        (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/gray-glacier.md) 
l2-1  | INFO [01-23|09:26:34.159]  
l2-1  | INFO [01-23|09:26:34.159] Merge configured: 
l2-1  | INFO [01-23|09:26:34.159]  - Hard-fork specification:    https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/paris.md 
l2-1  | INFO [01-23|09:26:34.159]  - Network known to be merged: true 
l2-1  | INFO [01-23|09:26:34.159]  - Total terminal difficulty:  0 
l2-1  | INFO [01-23|09:26:34.159]  - Merge netsplit block:       #0        
l2-1  | INFO [01-23|09:26:34.159]  
l2-1  | INFO [01-23|09:26:34.159] Post-Merge hard forks (timestamp based): 
l2-1  | INFO [01-23|09:26:34.159]  - Shanghai:                    @0          (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/shanghai.md) 
l2-1  | INFO [01-23|09:26:34.159]  - Regolith:                    @0          
l2-1  | INFO [01-23|09:26:34.159]  - Canyon:                      @0          
l2-1  | INFO [01-23|09:26:34.159]  
l2-1  | INFO [01-23|09:26:34.159] --------------------------------------------------------------------------------------------------------------------------------------------------------- 
l2-1  | INFO [01-23|09:26:34.159]  
l2-1  | INFO [01-23|09:26:34.160] Loaded most recent local block           number=1674 hash=ef8c6c..047f39 td=0 age=1m31s
l2-1  | INFO [01-23|09:26:34.160] Loaded most recent local finalized block number=0    hash=25888b..7cc416 td=0 age=57m19s
l2-1  | WARN [01-23|09:26:34.160] Head state missing, repairing            number=1674 hash=ef8c6c..047f39 snaproot=9f7e0e..950757
l2-1  | INFO [01-23|09:26:34.164] Loaded most recent local header          number=1674 hash=ef8c6c..047f39 td=0 age=1m31s
l2-1  | INFO [01-23|09:26:34.164] Loaded most recent local block           number=1547 hash=ea27c1..e1ea55 td=0 age=5m45s
l2-1  | INFO [01-23|09:26:34.164] Loaded most recent local snap block      number=1674 hash=ef8c6c..047f39 td=0 age=1m31s
l2-1  | INFO [01-23|09:26:34.164] Loaded most recent local finalized block number=0    hash=25888b..7cc416 td=0 age=57m19s
l2-1  | WARN [01-23|09:26:34.165] Enabling snapshot recovery               chainhead=1547 diskbase=1547
l2-1  | INFO [01-23|09:26:34.165] Initialising Ethereum protocol           network=901 dbversion=8
l2-1  | INFO [01-23|09:26:34.165] Initialized transaction indexer          limit=0
l2-1  | INFO [01-23|09:26:34.165] Loaded local transaction journal         transactions=0 dropped=0
l2-1  | INFO [01-23|09:26:34.165] Regenerated local transaction journal    transactions=0 accounts=0
l2-1  | INFO [01-23|09:26:34.166] Chain post-merge, sync via beacon client 
l2-1  | INFO [01-23|09:26:34.166] Unprotected transactions allowed 
l2-1  | INFO [01-23|09:26:34.166] Gasprice oracle is ignoring threshold set threshold=2
l2-1  | WARN [01-23|09:26:34.166] Engine API enabled                       protocol=eth
l2-1  | INFO [01-23|09:26:34.166] Starting peer-to-peer node               instance=Geth/v0.1.0-unstable-ec72ec6a-20240122/linux-arm64/go1.21.6
l2-1  | INFO [01-23|09:26:34.173] IPC endpoint opened                      url=/db/geth.ipc
l2-1  | INFO [01-23|09:26:34.173] New local node record                    seq=1,705,998,567,560 id=d19a3e268ba2155d ip=127.0.0.1 udp=0 tcp=30303
l2-1  | INFO [01-23|09:26:34.173] Started P2P networking                   self="enode://3dfcd24f82706c6138b6565c1039ac25d4f73143b2bf7fa68b59f974957970d0c80c3f156823292bb2ac78b55e3511e21828ad38186f9c53986242f902a07f58@127.0.0.1:30303?discport=0"
l2-1  | INFO [01-23|09:26:34.173] Loaded JWT secret file                   path=/config/test-jwt-secret.txt crc32=0x890e18e0
l2-1  | INFO [01-23|09:26:34.173] HTTP server started                      endpoint=[::]:8545 auth=false prefix= cors=* vhosts=*
l2-1  | INFO [01-23|09:26:34.174] WebSocket enabled                        url=ws://[::]:8546
l2-1  | INFO [01-23|09:26:34.174] WebSocket enabled                        url=ws://[::]:8551
l2-1  | INFO [01-23|09:26:34.174] HTTP server started                      endpoint=[::]:8551 auth=true  prefix= cors=localhost vhosts=*
l2-1  | INFO [01-23|09:26:36.627] Starting work on payload                 id=0xf1a827e46744b1a9
l2-1  | INFO [01-23|09:26:36.628] Updated payload                          id=0xf1a827e46744b1a9 number=1548 hash=902bfe..25e188 txs=1 withdrawals=0 gas=47617 fees=0 root=e6990f..b2d60c elapsed=1.324ms
l2-1  | INFO [01-23|09:26:36.628] Stopping work on payload                 id=0xf1a827e46744b1a9 reason=delivery elapsed=1
l2-1  | WARN [01-23|09:26:36.629] Ignoring already known beacon payload    number=1548 hash=902bfe..25e188 age=5m45s
l2-1  | INFO [01-23|09:26:36.650] Starting work on payload                 id=0x1ba3ea315e748ab3
l2-1  | ERROR[01-23|09:26:36.650] Failed to create sealing context         err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"
l2-1  | WARN [01-23|09:26:36.650] Error building payload update            id=0x1ba3ea315e748ab3 err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"
l2-1  | ERROR[01-23|09:26:36.751] Failed to create sealing context         err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"
l2-1  | WARN [01-23|09:26:36.751] Error building payload update            id=0x1ba3ea315e748ab3 err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"
l2-1  | ERROR[01-23|09:26:36.751] Error building any payload               id=0x1ba3ea315e748ab3 err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"
l2-1  | WARN [01-23|09:26:36.751] Served engine_getPayloadV3               conn=192.168.167.4:35002 reqid=1597 duration=100.529029ms err="Unknown payload"
l2-1  | WARN [01-23|09:26:36.751] Served engine_getPayloadV3               conn=192.168.167.4:35002 reqid=1598 duration="15.667µs"   err="Unknown payload"
l2-1  | INFO [01-23|09:26:36.751] Stopping work on payload                 id=0x1ba3ea315e748ab3 reason=delivery elapsed=101

I have identified the cause of the problem based on code analysis. When pruning, in order to prevent the impact of chain reorganization, the target block height for pruning is set to the bottom-most layer of difflayer by default. This conclusion can also be obtained from the pruning log:

INFO [01-23|09:25:44.052] Selecting bottom-most difflayer as the pruning target root=9f7e0e..950757 height=1547

Therefore, after pruning, only the state data of block 1547 is retained, and the state data of other blocks is deleted. After restarting op-geth, it found that the latest block height 1674 is missing state data, and will start to roll back the chain head until it rolls back to 1547. This can be seen from the following log:

l2-1  | WARN [01-23|09:26:34.160] Head state missing, repairing            number=1674 hash=ef8c6c..047f39 snaproot=9f7e0e..950757
l2-1  | INFO [01-23|09:26:34.164] Loaded most recent local header          number=1674 hash=ef8c6c..047f39 td=0 age=1m31s
l2-1  | INFO [01-23|09:26:34.164] Loaded most recent local block           number=1547 hash=ea27c1..e1ea55 td=0 age=5m45s
l2-1  | INFO [01-23|09:26:34.164] Loaded most recent local snap block      number=1674 hash=ef8c6c..047f39 td=0 age=1m31s
l2-1  | INFO [01-23|09:26:34.164] Loaded most recent local finalized block number=0    hash=25888b..7cc416 td=0 age=57m19s
l2-1  | WARN [01-23|09:26:34.165] Enabling snapshot recovery               chainhead=1547 diskbase=1547

Next, we can see 1548 blocks of high processing logs:

l2-1  | INFO [01-23|09:26:36.627] Starting work on payload                 id=0xf1a827e46744b1a9
l2-1  | INFO [01-23|09:26:36.628] Updated payload                          id=0xf1a827e46744b1a9 number=1548 hash=902bfe..25e188 txs=1 withdrawals=0 gas=47617 fees=0 root=e6990f..b2d60c elapsed=1.324ms
l2-1  | INFO [01-23|09:26:36.628] Stopping work on payload                 id=0xf1a827e46744b1a9 reason=delivery elapsed=1
l2-1  | WARN [01-23|09:26:36.629] Ignoring already known beacon payload    number=1548 hash=902bfe..25e188 age=5m45s

Although the status data of 1548 has been deleted, its header, body, and other data are still present. Therefore, the newPayload method will ignore the 1548 block and print the following log:

l2-1  | WARN [01-23|09:26:36.629] Ignoring already known beacon payload    number=1548 hash=902bfe..25e188 age=5m45s

The Code at: https://github.com/ethereum-optimism/op-geth/blob/ec72ec6a285e85761a21d6700af8c52513fed9eb/eth/catalyst/api.go#L540 Therefore, when executing block 1549, the forkchoiceUpdated method threw an error:

l2-1  | INFO [01-23|09:26:36.650] Starting work on payload                 id=0x1ba3ea315e748ab3
l2-1  | ERROR[01-23|09:26:36.650] Failed to create sealing context         err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"
l2-1  | WARN [01-23|09:26:36.650] Error building payload update            id=0x1ba3ea315e748ab3 err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"
l2-1  | ERROR[01-23|09:26:36.751] Failed to create sealing context         err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"
l2-1  | WARN [01-23|09:26:36.751] Error building payload update            id=0x1ba3ea315e748ab3 err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"
l2-1  | ERROR[01-23|09:26:36.751] Error building any payload               id=0x1ba3ea315e748ab3 err="missing trie node e6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c (path ) state 0xe6990f13932a9ac0e593d193129c3d792f2181cbb8c63f18ffa32d1d3eb2d60c is not available, not found"

This is because in api.eth.Miner().BuildPayload(args), the logic for creating the context environment for building block height is called, and the specific code is at: https://github.com/ethereum-optimism/op-geth/blob/ec72ec6a285e85761a21d6700af8c52513fed9eb/miner/worker.go#L743 If the parent block is missing the state, then the context cannot be built and an error will be returned. The reason my version reports "State not available, ignoring new payload" is because of this condition: if historicalBackend, ok := w.eth.(BackendWithHistoricalState); ok {, which is equal to true in my case. Although in my version, historicalBackend.StateAtBlock will try to re-execute the missing parent block to obtain the stateDB structure, there is a bug here: https://github.com/ethereum-optimism/op-geth/blob/ec72ec6a285e85761a21d6700af8c52513fed9eb/miner/worker.go#L749 The Copy method on this line does not perform a deep copy of state.Database, so the next line's release() will delete the reconstructed state data from the dirties cache, causing an error in the execution of block 1549. This error will not cause the block to fail, but it will result in a different state hash and block hash after the block is executed. Eventually, when the code reaches https://github.com/ethereum-optimism/op-geth/blob/ec72ec6a285e85761a21d6700af8c52513fed9eb/eth/catalyst/api.go#L586, an error will be thrown.

My questions are:

  1. Do you have any plans for pruning support? Do you have any suggestions for fixing the issues I encountered during testing?
  2. Could you help me confirm if the fail reasons I have identified are correct?