erigontech / erigon

Ethereum implementation on the efficiency frontier https://erigon.gitbook.io
GNU Lesser General Public License v3.0
3.1k stars 1.09k forks source link

astrid: stage integration fix for canonical hashes on unwinds #11887

Closed taratorio closed 1 week ago

taratorio commented 1 week ago

Previously I tried to re-use fixCanonicalChain in the astrid stage for handling fork choice update however after more testing I realised that was wrong since it causes the below issue upon unwinds:

INFO[09-05|08:27:38.289] [4/6 Execution] Unwind Execution         from=11588734 to=11588733
EROR[09-05|08:27:38.289] Staged Sync                              err="[4/6 Execution] domains.GetDiffset(11588734, 0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb): not found"

That is due the fact that fixCanonicalChain updates the canonical hash as it traverses the header chain backwards. In the context of update fork choice, this is something that should be done only after the Unwind has succeeded, otherwise we will get the above error when unwinding execution.

This also causes the 2nd error below which is a result of the Execution Unwind failing and rolling back the RwTx of the stage loop (insert block changes get lost). This situation will be further stabilised when working on https://github.com/erigontech/erigon/issues/11533

This PR fixes the first problem by creating a new function connectTip which traverses the header chain backwards, collects new nodes and bad nodes but does not update the canonical hash while doing so. Instead the canonical hashes get updated in updateForkChoiceForward after the unwind has been successfully executed.

Full Logs

INFO[09-05|08:27:36.210] [2/6 PolygonSync] forward                progress=11588734
DBUG[09-05|08:27:38.222] [bridge] processing new blocks           from=11588729 to=11588729 lastProcessedBlockNum=11588720 lastProcessedBlockTime=1725521200 lastProcessedEventID=2702
DBUG[09-05|08:27:38.222] [sync] inserted blocks                   len=1 duration=1.882458ms
DBUG[09-05|08:27:38.286] [bridge] processing new blocks           from=11588734 to=11588734 lastProcessedBlockNum=11588720 lastProcessedBlockTime=1725521200 lastProcessedEventID=2702
DBUG[09-05|08:27:38.287] [sync] inserted blocks                   len=1 duration=945.75µs
DBUG[09-05|08:27:38.287] [bor.heimdall] synchronizing spans...    blockNum=11588734
DBUG[09-05|08:27:38.287] [bridge] synchronizing events...         blockNum=11588734 lastProcessedBlockNum=11588720
INFO[09-05|08:27:38.287] [2/6 PolygonSync] update fork choice     block=11588734 age=1s hash=0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb
INFO[09-05|08:27:38.287] [2/6 PolygonSync] new fork - unwinding and caching fork choice unwindNumber=11588733 badHash=0x3e1f67072996aec05806d298de3bb281bdcf23566da0dc254c4670ac385768d4 cachedTipNumber=115
88734 cachedTipHash=0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb cachedNewNodes=1
DBUG[09-05|08:27:38.289] UnwindTo                                 block=11588733 block_hash=0x3e1f67072996aec05806d298de3bb281bdcf23566da0dc254c4670ac385768d4 err=nil stack="[sync.go:171 stage_polygon_syn
c.go:1425 stage_polygon_sync.go:1379 stage_polygon_sync.go:1538 stage_polygon_sync.go:494 stage_polygon_sync.go:175 default_stages.go:479 sync.go:531 sync.go:410 stageloop.go:249 stageloop.go:101 asm_arm6
4.s:1222]"
DBUG[09-05|08:27:38.289] [2/6 PolygonSync] DONE                   in=2.078894042s
INFO[09-05|08:27:38.289] [4/6 Execution] Unwind Execution         from=11588734 to=11588733
EROR[09-05|08:27:38.289] Staged Sync                              err="[4/6 Execution] domains.GetDiffset(11588734, 0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb): not found"
INFO[09-05|08:27:38.792] [4/6 Execution] Unwind Execution         from=11588734 to=11588733
INFO[09-05|08:27:38.792] aggregator unwind                        step=24 txUnwindTo=38128876 stepsRangeInDB="accounts:0.4, storage:0.4, code:0.4, commitment:0.0, logaddrs: 0.4, logtopics: 0.4, tracesfrom: 0.4, tracesto: 0.4"
DBUG[09-05|08:27:38.818] [1/6 OtterSync] DONE                     in=2.958µs
INFO[09-05|08:27:38.818] [2/6 PolygonSync] forward                progress=11588733
INFO[09-05|08:27:38.818] [2/6 PolygonSync] new fork - processing cached fork choice after unwind cachedTipNumber=11588734 cachedTipHash=0x04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb cachedNewNodes=1
DBUG[09-05|08:27:39.083] [bor.heimdall] block producers tracker observing span id=1812
DBUG[09-05|08:27:43.532] Error while executing stage              err="[2/6 PolygonSync] stopped: parent's total difficulty not found with hash 04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb and height 11588734: <nil>"
EROR[09-05|08:27:43.532] [2/6 PolygonSync] stopping node          err="parent's total difficulty not found with hash 04f1528479c5efae05ac05e38e1402c4e59155049ff44ce6bf5302acb2c25fdb and height 11588734: <nil>"
DBUG[09-05|08:27:43.534] rpcdaemon: the subscription to pending blocks channel was closed 
INFO[09-05|08:27:43.534] Exiting... 
INFO[09-05|08:27:43.535] HTTP endpoint closed                     url=127.0.0.1:8545
INFO[09-05|08:27:43.535] RPC server shutting down