ethereum / go-ethereum

Go implementation of the Ethereum protocol
https://geth.ethereum.org
GNU Lesser General Public License v3.0
47.51k stars 20.11k forks source link

Frequent desynchronizations - Failed to clean stale beacon headers #27204

Open PanosChtz opened 1 year ago

PanosChtz commented 1 year ago

System information

Geth version: 1.11.5 CL client & version: nimbus 23.3.2 OS & Version: Ubuntu 20.04 LTS

Expected behaviour

Geth remaining in sync.

Actual behaviour

Desynchronizing and missing attestations very frequently, counted 13 times over the last 48 hours

Steps to reproduce the behaviour

N/Α

Backtrace

Some recent logs when this desync happened:

Apr 26 12:51:40 validator geth[1732613]: INFO [04-26|12:51:40.374] Imported new potential chain segment     number=17,130,514 hash=a61584..9bbf0b blocks=1 txs=264  mgas=22.084  elapsed=488.564ms    mgasps=45.201  dirty=510.78MiB
Apr 26 12:51:47 validator geth[1732613]: INFO [04-26|12:51:47.082] Chain head was updated                   number=17,130,514 hash=a61584..9bbf0b root=7f17cf..f52061 elapsed=28.596779ms
Apr 26 12:52:49 validator geth[1732613]: WARN [04-26|12:52:49.548] Ignoring payload with missing parent     number=17,130,520 hash=96fe98..1b2575 parent=1c7ddb..15cd97
Apr 26 12:52:50 validator geth[1732613]: INFO [04-26|12:52:50.319] Forkchoice requested sync to new head    number=17,130,520 hash=96fe98..1b2575 finalized=unknown
Apr 26 12:52:50 validator geth[1732613]: WARN [04-26|12:52:50.321] Beacon chain gapped                      head=17,130,471 newHead=17,130,520
Apr 26 12:52:50 validator geth[1732613]: INFO [04-26|12:52:50.573] Syncing beacon headers                   downloaded=320 left=0 eta=0s
Apr 26 12:52:50 validator geth[1732613]: ERROR[04-26|12:52:50.578] Failed to clean stale beacon headers     err="filled header below beacon header tail: 17130471 < 17130515"
Apr 26 12:52:50 validator geth[1732613]: WARN [04-26|12:52:50.607] Retrieved pivot header from local        number=17,130,456 hash=c7327e..dec849 latest=17,130,520 oldest=17,130,515
Apr 26 12:52:54 validator geth[1732613]: INFO [04-26|12:52:54.650] Imported new chain segment               number=17,130,520 hash=96fe98..1b2575 blocks=6 txs=885  mgas=74.090  elapsed=3.575s       mgasps=20.719  dirty=510.28MiB
...
Apr 26 17:57:37 validator geth[1732613]: WARN [04-26|17:57:37.863] Ignoring payload with missing parent     number=17,132,032 hash=b04aa1..b5cc33 parent=dfb7ec..0261ed
Apr 26 17:57:39 validator geth[1732613]: INFO [04-26|17:57:39.179] Forkchoice requested sync to new head    number=17,132,032 hash=b04aa1..b5cc33 finalized=unknown
Apr 26 17:57:39 validator geth[1732613]: WARN [04-26|17:57:39.182] Beacon chain gapped                      head=17,131,918 newHead=17,132,032
Apr 26 17:57:40 validator geth[1732613]: INFO [04-26|17:57:40.626] Syncing beacon headers                   downloaded=339 left=0 eta=0s
Apr 26 17:57:40 validator geth[1732613]: ERROR[04-26|17:57:40.633] Failed to clean stale beacon headers     err="filled header below beacon header tail: 17131918 < 17132024"
Apr 26 17:57:41 validator geth[1732613]: WARN [04-26|17:57:41.012] Retrieved pivot header from local        number=17,131,968 hash=94a8f5..d750eb latest=17,132,032 oldest=17,132,024
Apr 26 17:57:45 validator geth[1732613]: INFO [04-26|17:57:45.233] Chain reorg detected                     number=17,132,022 hash=5dbf2d..f3ff48 drop=1 dropfrom=38e33a..324257 add=2 addfrom=7d90b5..f816fb
Apr 26 17:57:49 validator geth[1732613]: WARN [04-26|17:57:49.617] Peer delivering stale transactions       peer=74f473c67422999be06eaedf8f327f9ebe0b66d3e7241ac47d67f65e363bf9e7 rejected=128
...
Apr 27 11:41:21 validator geth[1732613]: INFO [04-27|11:41:21.147] Imported new chain segment               number=17,137,276 hash=71b671..41061a blocks=4 txs=578  mgas=62.225  elapsed=5.150s       mgasps=12.082  dirty=511.06MiB
Apr 27 11:41:23 validator geth[1732613]: INFO [04-27|11:41:23.480] Forkchoice requested sync to new head    number=17,137,279 hash=7a119b..b15ec0 finalized=unknown
Apr 27 11:41:23 validator geth[1732613]: WARN [04-27|11:41:23.522] Beacon chain gapped                      head=17,137,277 newHead=17,137,279
Apr 27 11:41:23 validator geth[1732613]: ERROR[04-27|11:41:23.736] Cleaning spurious beacon sync leftovers  stale_head_1=17,137,277 stale_tail_1=17,137,276 stale_next_1=07b2bd..52529d
Apr 27 11:41:23 validator geth[1732613]: INFO [04-27|11:41:23.738] Syncing beacon headers                   downloaded=388 left=0 eta=0s
Apr 27 11:41:23 validator geth[1732613]: ERROR[04-27|11:41:23.740] Failed to clean stale beacon headers     err="filled header below beacon header tail: 17137276 < 17137277"
Apr 27 11:41:23 validator geth[1732613]: WARN [04-27|11:41:23.868] Retrieved pivot header from local        number=17,137,215 hash=8f4ae1..3415d9 latest=17,137,279 oldest=17,137,277
Apr 27 11:41:27 validator geth[1732613]: WARN [04-27|11:41:27.399] Ignoring already known beacon payload    number=17,137,278 hash=ff09b8..2a3b38 age=28s
Apr 27 11:41:28 validator geth[1732613]: INFO [04-27|11:41:28.845] Forkchoice requested sync to new head    number=17,137,280 hash=453a08..82fb0c finalized=unknown
...
May 01 15:45:39 validator geth[1732613]: INFO [05-01|15:45:39.001] Forkchoice requested sync to new head    number=17,166,952 hash=71c82e..10dbb1 finalized=unknown
May 01 15:45:39 validator geth[1732613]: WARN [05-01|15:45:39.007] Beacon chain gapped                      head=17,166,936 newHead=17,166,952
May 01 15:45:39 validator geth[1732613]: INFO [05-01|15:45:39.487] Syncing beacon headers                   downloaded=575 left=0 eta=0s
May 01 15:45:39 validator geth[1732613]: ERROR[05-01|15:45:39.496] Failed to clean stale beacon headers     err="filled header below beacon header tail: 17166936 < 17166946"
May 01 15:45:39 validator geth[1732613]: WARN [05-01|15:45:39.549] Retrieved pivot header from local        number=17,166,888 hash=827ea6..7fda1f latest=17,166,952 oldest=17,166,946
May 01 15:45:49 validator geth[1732613]: INFO [05-01|15:45:49.468] Forkchoice requested sync to new head    number=17,166,953 hash=c929ad..97e3c9 finalized=unknown
PanosChtz commented 1 year ago

A log from nimbus:

May 01 15:44:11 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:11.001+00:00 Slot start                                 topics="beacnde" slot=6344319 epoch=198259 sync=synced peers=30 head=66e786ed:6344318 finalized=198257:9bd7d165 delay=65us165ns
May 01 15:44:23 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:23.392+00:00 State replayed                             topics="chaindag" blocks=0 slots=1 current=a1d12164:6344288@6344320 ancestor=169a2994:6344319 target=169a2994:6344319@6344320 ancestorStateRoot=f6301e52 targetStateRoot=603441af found=true assignDur=5s487ms702us197ns replayDur=2s225ms51us868ns
May 01 15:44:24 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:24.600+00:00 Missed multiple heartbeats                 topics="libp2p gossipsub" heartbeat=GossipSub delay=8s858ms750us978ns hinterval=700ms
May 01 15:44:24 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:24.802+00:00 Slot end                                   topics="beacnde" slot=6344319 nextActionWait=34s208ms95us148ns nextAttestationSlot=6344323 nextProposalSlot=-1 syncCommitteeDuties=none head=169a2994:6344319
May 01 15:44:25 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:25.071+00:00 Slot start                                 topics="beacnde" slot=6344320 epoch=198260 sync=synced peers=30 head=169a2994:6344319 finalized=198257:9bd7d165 delay=2s65ms617us214ns
May 01 15:44:26 validator nimbus_beacon_node[1234080]: WRN 2023-05-01 15:44:26.023+00:00 Failed to exchange transition configuration topics="elmon" url=http://127.0.0.1:8551 err=Timeout
May 01 15:44:46 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:45.943+00:00 State replayed                             topics="chaindag" blocks=0 slots=1 current=169a2994:6344319 ancestor=169a2994:6344319 target=169a2994:6344319@6344320 ancestorStateRoot=f6301e52 targetStateRoot=603441af found=true assignDur=34us278ns replayDur=19s122ms667us996ns
May 01 15:44:48 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:48.069+00:00 State replayed                             topics="chaindag" blocks=0 slots=0 current=169a2994:6344319 ancestor=c5057235:6344320 target=c5057235:6344320 ancestorStateRoot=474f4bb2 targetStateRoot=474f4bb2 found=true assignDur=706ms8us577ns replayDur=19ms742us223ns
May 01 15:44:49 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:49.177+00:00 Missed multiple heartbeats                 topics="libp2p gossipsub" heartbeat=GossipSub delay=21s77ms216us911ns hinterval=700ms
May 01 15:44:49 validator nimbus_beacon_node[1234080]: WRN 2023-05-01 15:44:49.251+00:00 Discovered new external address but ENR auto update is off topics="eth p2p discv5" majority=some(100.36.126.155:9000) previous=none(Address)
May 01 15:44:50 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:49.999+00:00 Slot end                                   topics="beacnde" slot=6344320 nextActionWait=9s4ms625us264ns nextAttestationSlot=6344323 nextProposalSlot=-1 syncCommitteeDuties=none head=c5057235:6344320
May 01 15:44:50 validator nimbus_beacon_node[1234080]: NOT 2023-05-01 15:44:50.192+00:00 Missed expected slot start, catching up    topics="beacnde" delay=15s192ms947us99ns curSlot=6344320 nextSlot=6344320
May 01 15:44:50 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:50.193+00:00 Slot start                                 topics="beacnde" slot=6344322 epoch=198260 sync=synced/opt peers=30 head=c5057235:6344320 finalized=198258:777a66f0 delay=15s192ms947us99ns
May 01 15:44:50 validator nimbus_beacon_node[1234080]: INF 2023-05-01 15:44:50.227+00:00 Execution client not in sync; skipping validator duties for now topics="beacval" slot=6344322 headSlot=6344320
czepluch commented 1 year ago

Did you fix this? I'm experiencing the same issue currently. But very frequently. For me it probably happens every 20 minutes that I get out of sync like this and then geth catches up for a couple of epochs and then it happens again.

czepluch commented 1 year ago

So it seems that everything works for around 15 minutes at a time and then geth starts falling behind. It looks like at some point it takes 10+ seconds to import the next block and then this behavior persists for around 10 minutes or so until geth again catches up.

This is what the logs look like when geth starts falling behind:

INFO [07-10|19:40:00.674] Imported new potential chain segment     number=17,665,394 hash=214583..b18900 blocks=1 txs=167  mgas=12.536 elapsed=401.459ms    mgasps=31.226  dirty=483.96MiB
INFO [07-10|19:40:00.701] Chain head was updated                   number=17,665,394 hash=214583..b18900 root=a0866f..550c3d elapsed=7.428631ms
INFO [07-10|19:40:12.151] Imported new potential chain segment     number=17,665,395 hash=9ec4d7..c2fe0f blocks=1 txs=184  mgas=18.248 elapsed=274.441ms    mgasps=66.492  dirty=484.48MiB
INFO [07-10|19:40:12.175] Chain head was updated                   number=17,665,395 hash=9ec4d7..c2fe0f root=b7ba48..8a0030 elapsed=8.338737ms
INFO [07-10|19:40:25.321] Imported new potential chain segment     number=17,665,396 hash=10bb49..0f011d blocks=1 txs=124  mgas=10.515 elapsed=963.331ms    mgasps=10.915  dirty=484.26MiB
INFO [07-10|19:40:25.340] Chain head was updated                   number=17,665,396 hash=10bb49..0f011d root=44f4f1..1e1484 elapsed=4.572111ms
INFO [07-10|19:40:50.035] Imported new potential chain segment     number=17,665,397 hash=36d600..d454c3 blocks=1 txs=137  mgas=13.922 elapsed=13.692s      mgasps=1.017   dirty=484.70MiB
WARN [07-10|19:40:50.038] Ignoring already known beacon payload    number=17,665,397 hash=36d600..d454c3 age=15s
INFO [07-10|19:40:50.361] Chain head was updated                   number=17,665,397 hash=36d600..d454c3 root=baf3ae..bd0970 elapsed=3.34156ms
INFO [07-10|19:41:09.321] Imported new potential chain segment     number=17,665,398 hash=029e4e..4bffe7 blocks=1 txs=146  mgas=22.012 elapsed=17.957s      mgasps=1.226   dirty=484.29MiB
WARN [07-10|19:41:09.325] Ignoring already known beacon payload    number=17,665,398 hash=029e4e..4bffe7 age=22s
WARN [07-10|19:41:09.329] Ignoring already known beacon payload    number=17,665,398 hash=029e4e..4bffe7 age=22s
WARN [07-10|19:41:12.582] Ignoring already known beacon payload    number=17,665,398 hash=029e4e..4bffe7 age=25s
INFO [07-10|19:41:12.796] Chain head was updated                   number=17,665,398 hash=029e4e..4bffe7 root=3b9275..098783 elapsed=11.192582ms
INFO [07-10|19:41:33.223] Imported new potential chain segment     number=17,665,399 hash=876543..d7bb05 blocks=1 txs=133  mgas=11.539 elapsed=20.200s      mgasps=0.571   dirty=484.44MiB
WARN [07-10|19:41:33.225] Ignoring already known beacon payload    number=17,665,399 hash=876543..d7bb05 age=34s
WARN [07-10|19:41:33.226] Ignoring already known beacon payload    number=17,665,399 hash=876543..d7bb05 age=34s
WARN [07-10|19:41:34.839] Ignoring already known beacon payload    number=17,665,399 hash=876543..d7bb05 age=35s
INFO [07-10|19:41:35.070] Chain head was updated                   number=17,665,399 hash=876543..d7bb05 root=4974f0..b5a37e elapsed=4.50465ms
INFO [07-10|19:41:53.187] Imported new potential chain segment     number=17,665,400 hash=9cc0ce..d3c5ea blocks=1 txs=136  mgas=10.196 elapsed=13.923s      mgasps=0.732   dirty=484.79MiB
WARN [07-10|19:41:53.188] Ignoring already known beacon payload    number=17,665,400 hash=9cc0ce..d3c5ea age=42s
INFO [07-10|19:41:53.197] Chain head was updated                   number=17,665,400 hash=9cc0ce..d3c5ea root=553572..ec127c elapsed=2.357216ms
INFO [07-10|19:42:09.077] Imported new potential chain segment     number=17,665,401 hash=1405fc..4948e6 blocks=1 txs=154  mgas=11.435 elapsed=15.572s      mgasps=0.734   dirty=482.97MiB
WARN [07-10|19:42:09.079] Ignoring already known beacon payload    number=17,665,401 hash=1405fc..4948e6 age=46s
INFO [07-10|19:42:09.089] Chain head was updated                   number=17,665,401 hash=1405fc..4948e6 root=60bf1b..3f14f0 elapsed=2.742464ms