OffchainLabs / nitro

Nitro goes vroom and fixes everything
Other
719 stars 418 forks source link

debug calls leading to ledger corruption #2116

Open avinashbo opened 6 months ago

avinashbo commented 6 months ago

Describe the bug

Debug calls are corrupting the node's ledger and halts the sync. The only recovery option is to restore the node from a recent backup only to find it corrupted on a few hours of serving RPC calls

To Reproduce Steps to reproduce the behavior:

We are trying to learn how to reproduce the issue consistently. But debug_trace* calls always precede the ledger corruption.

Expected behavior

The node syncs fine without any problems even while serving debug calls.

Screenshots

Some logs

WARN [01-22|19:20:10.802] Served debug_traceBlockByNumber          conn=10.100.12.31:36500 reqid=36      duration="23.565µs"   err="block #173110800 not found"
WARN [01-22|19:20:10.802] Served debug_traceBlockByNumber          conn=10.100.12.31:36500 reqid=38      duration="25.21µs"    err="block #173110801 not found"
WARN [01-22|19:20:10.805] feedOneMsg failed to send message to execEngine err="commit aborted due to earlier error: missing trie node fbc6038e840526c46ed6dfe7cda65badd82880a02849e9aa34cf2d5caebf21c1 (owner ba0a0230d172632e2e1f3536a62be4efd9a43e2b1e1cb79a137ce41e9835439f) (path ) <nil>" pos=150902968
WARN [01-22|19:20:10.870] Trie prefetcher failed opening trie      root=fbc603..bf21c1 err="missing trie node fbc6038e840526c46ed6dfe7cda65badd82880a02849e9aa34cf2d5caebf21c1 (owner ba0a0230d172632e2e1f3536a62be4efd9a43e2b1e1cb79a137ce41e9835439f) (path ) <nil>"
WARN [01-22|19:20:10.983] Trie prefetcher failed opening trie      root=fbc603..bf21c1 err="missing trie node fbc6038e840526c46ed6dfe7cda65badd82880a02849e9aa34cf2d5caebf21c1 (owner ba0a0230d172632e2e1f3536a62be4efd9a43e2b1e1cb79a137ce41e9835439f) (path ) <nil>"

Additional context

avinashbo commented 5 months ago

We tried adding this flag --init.recreate-missing-state-from=1 available on v2.2.4 and scanned the entire ledger. The process ended up recreating 0 missing tries.

We tried syncing a new node from scratch as suggested by few people in discord. After trying to sync for a month on eNVME locally attached disks, the node is still nowhere close to the tip.

ymonye commented 4 months ago

We tried adding this flag --init.recreate-missing-state-from=1 available on v2.2.4 and scanned the entire ledger. The process ended up recreating 0 missing tries.

We tried syncing a new node from scratch as suggested by few people in discord. After trying to sync for a month on eNVME locally attached disks, the node is still nowhere close to the tip.

Are you placing debug_trace* calls through the RPC or IPC connection? We've been running a Nitro archive node since its launch date, while placing debug_traceTransaction calls consistently without issue. The calls are however going through the IPC, as all Go-Ethereum forked client RPC endpoints can barely handle debug_trace* load.

--ipc.path="/home/user/.arbitrum/arbitrum.ipc"