berachain / beacon-kit

A modular framework for building EVM consensus clients ⛵️✨
https://berachain.com
Other
174 stars 128 forks source link

Beacond restart loop #2107

Open aditya-manit opened 4 weeks ago

aditya-manit commented 4 weeks ago

Our validator node started throwing this in loop

time.Sleep(0x77359400)
        runtime/time.go:285 +0xf2
github.com/cometbft/cometbft/internal/consensus.(*Reactor).queryMaj23Routine(0xc00571e240, {0x294e6b0, 0xc0055fc9c0}, 0xc0055fca90)
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/internal/consensus/reactor.go:799 +0x5b
created by github.com/cometbft/cometbft/internal/consensus.(*Reactor).AddPeer in goroutine 2514495328
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/internal/consensus/reactor.go:214 +0x1bb
goroutine 2749043931 [select]:
github.com/cometbft/cometbft/p2p.(*peer).metricsReporter(0xc014c94dd0)
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/p2p/peer.go:364 +0x12c
created by github.com/cometbft/cometbft/p2p.(*peer).OnStart in goroutine 2749043915
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/p2p/peer.go:198 +0x66
goroutine 6422858742 [select]:
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker.func1()
        github.com/cockroachdb/pebble@v1.1.1/vfs/disk_health.go:171 +0xc5
created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker in goroutine 396263
        github.com/cockroachdb/pebble@v1.1.1/vfs/disk_health.go:166 +0x58
goroutine 2128295989 [select]:
github.com/cometbft/cometbft/p2p.(*peer).metricsReporter(0xc012213a00)
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/p2p/peer.go:364 +0x12c
created by github.com/cometbft/cometbft/p2p.(*peer).OnStart in goroutine 2128293927
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/p2p/peer.go:198 +0x66
goroutine 142791532 [sleep]:
time.Sleep(0x5f5e100)
        runtime/time.go:285 +0xf2
github.com/cometbft/cometbft/internal/consensus.(*Reactor).gossipDataRoutine(0xc00571e240, {0x294e6b0, 0xc010fd4f70}, 0xc010fd5040)
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/internal/consensus/reactor.go:642 +0x24e
created by github.com/cometbft/cometbft/internal/consensus.(*Reactor).AddPeer in goroutine 6801
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/internal/consensus/reactor.go:212 +0xe7
goroutine 1452414648 [sleep]:
time.Sleep(0x77359400)
        runtime/time.go:285 +0xf2
github.com/cometbft/cometbft/internal/consensus.(*Reactor).queryMaj23Routine(0xc00571e240, {0x294e6b0, 0xc00b242a90}, 0xc00b242b60)
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/internal/consensus/reactor.go:799 +0x5b
created by github.com/cometbft/cometbft/internal/consensus.(*Reactor).AddPeer in goroutine 6801
        github.com/cometbft/cometbft@v1.0.0-rc1.0.20240806094948-2c4293ef36c4/internal/consensus/reactor.go:214 +0x1bb
goroutine 2128295799 [sleep]:
time.Sleep(0x5f5e100)

Beacond commit: dd024c5b196afe43cf871b5f825a7e371fc4542e Reth commit: 1ba631ba9581973e7c6cadeea92cfe1802aceb4a

Version: v0.2.0-alpha.8 Version: 1.1.0

abi87 commented 4 weeks ago

Hello @aditya-manit, thanks for filing the issue. Any chance we can get the full log? I would like to check what caused the issue in the first place! Thanks

aditya-manit commented 3 weeks ago
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: #033[90m2024-10-27T04:20:10+01:00 #033[32mINFO#033[0m Finalizing commit of block module=consensus#033[0m height=6595637#033[0m hash=C3F9FBD5EA8E4C1F39D9CA79B603D42FF0DF55802D43716DEB61DA97B1802240#033[0m root=19F82FB75EF526E72C1B333A212170F79073D64E6876DD6C9A4E8F5591EAC7F1#033[0m num_txs=2#033[0m
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: #033[90m2024-10-27T04:20:10+01:00 #033[32mINFO#033[0m Inserted new payload into execution chain service=execution-engine#033[0m payload_block_hash=0x537eb7b342455946863c152e7d70cbdfbdb6ba6a9f67302500119c079129e572#033[0m payload_parent_block_hash=0x390033bb3b320963581746560cf6496f88d7fcb5528b48a272fff2ede6382219#033[0m is_optimistic=true#033[0m
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: fatal error: concurrent map iteration and map write
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: goroutine 4420 [running]:
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: net/http.validateHeaders(0x52e3ad?)
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: #011net/http/transport.go:514 +0x4a
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: net/http.(*Transport).roundTrip(0x3d08380, 0xc0001f8dc0)
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: #011net/http/transport.go:547 +0x16e
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: net/http.(*Transport).RoundTrip(0x3128010?, 0x28fc680?)
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: #011net/http/roundtrip.go:30 +0x13
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: net/http.send(0xc0001f8dc0, {0x28fc680, 0x3d08380}, {0xc005963601?, 0x41824b?, 0x0?})
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: #011net/http/client.go:259 +0x5e4
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: net/http.(*Client).send(0x3e5c780, 0xc0001f8dc0, {0x0?, 0xc0059636c8?, 0x0?})
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: #011net/http/client.go:180 +0x98
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: net/http.(*Client).do(0x3e5c780, 0xc0001f8dc0)
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: #011net/http/client.go:725 +0x8bc
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: net/http.(*Client).Do(...)
Oct 27 04:20:10 berachain-testnetv2 beacond[1178]: #011net/http/client.go:590

Here is the log file too with logs related to this issue issue.log

aditya-manit commented 3 weeks ago

We were running beacond using systemd service, had to manually stop and start the process to get it resolved

gummybera commented 3 weeks ago

{ Could it be related to https://github.com/berachain/beacon-kit/issues/2057 ? }

sbond14 commented 3 weeks ago

I had the same issue running with docker and a container orchestrator. It resolved itself when the container failed with "fatal error:concurrent map read and map write" and restarted