AntelopeIO / spring

C++ implementation of the Antelope protocol with Savanna consensus
Other
8 stars 3 forks source link

nodeos hangs on "terminate called after throwing an instance of 'Serialization::FatalSerializationException'" #866

Open the-smooth-operator opened 6 days ago

the-smooth-operator commented 6 days ago

Since the spring upgrade, we have observed that our nodes suddenly stop responding to RPC calls and stop emitting normal logs to only show terminate called after throwing an instance of 'Serialization::FatalSerializationException' each ~30minutes:

...
debug 2024-10-02T04:20:20.569 nodeos    controller.cpp:4612           clear_expired_input_ ] removed 19 expired transactions of the 3761 input dedup list, pending block time 2024-10-02T01:44:12.500
debug 2024-10-02T04:20:20.585 nodeos    controller.cpp:4612           clear_expired_input_ ] removed 0 expired transactions of the 3750 input dedup list, pending block time 2024-10-02T01:44:13.000
debug 2024-10-02T04:20:20.611 nodeos    controller.cpp:4612           clear_expired_input_ ] removed 30 expired transactions of the 3759 input dedup list, pending block time 2024-10-02T01:44:13.500
debug 2024-10-02T04:20:20.636 net-1     net_plugin.cpp:2244           sync_timeout         ] ["eosn-eos-seed171:9876 - c511972" - 1 209.249.216.152:9876] sync timeout
terminate called after throwing an instance of 'Serialization::FatalSerializationException'
terminate called after throwing an instance of 'Serialization::FatalSerializationException'
terminate called after throwing an instance of 'Serialization::FatalSerializationException'
terminate called after throwing an instance of 'Serialization::FatalSerializationException'

Furthermore, when restarting the process (it restarts gracefully) it always comes up with a corrupted state, so we need to start it from a snapshot, or perform a long replay operation.

error 2024-10-02T07:32:37.800 nodeos    main.cpp:224                  main                 ] 3020000 fork_database_exception: Fork database exception
No existing fork database despite existing chain state. Replay required.
    {}
    nodeos  controller.cpp:1705 replay

    {}
    nodeos  chain_plugin.cpp:1159 plugin_startup

While the process is "stuck" there's network activity in the host and the process is still consuming it's memory and using CPU cycles. However disk operations drop to 0.

bhazzard commented 4 days ago

@the-smooth-operator, When was the first observed occurrence of this?

Can you share:

the-smooth-operator commented 1 day ago

This started happening on Friday 27th of October. .Memory limit is setup at 32GB however the process rarely uses it all. We have bumped to 64GB and saw this happening again

config.ini

p2p-peer-address = eos.seed.eosnation.io:9876
p2p-peer-address = peer1.eosphere.io:9876
p2p-peer-address = peer2.eosphere.io:9876
p2p-peer-address = p2p.genereos.io:9876

# Plugins
plugin = eosio::chain_plugin
plugin = eosio::net_plugin
plugin = eosio::http_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::net_api_plugin

# Stability Configs
http-validate-host = false
chain-state-db-size-mb = 89200

# Default settings are way too low
net-threads = 6
http-threads = 6
chain-threads = 6

Most recent logs previous to the error:

    2024-10-04 16:52:27.164 
debug 2024-10-04T14:52:27.164 net-1     net_plugin.cpp:3816           operator()           ] validated block header, broadcasting immediately, connection - 1, blk num = 397510466, id = 17b18742b890a90bc75be74baacc27f278f03bdef6e083c5edc8e8627937149b
2024-10-04 16:52:27.164 
debug 2024-10-04T14:52:27.164 net-1     net_plugin.cpp:2694           bcast_block          ] bcast block 397510466
2024-10-04 16:52:27.164 
debug 2024-10-04T14:52:27.164 net-1     net_plugin.cpp:3822           operator()           ] posting block 397510466 to app thread
2024-10-04 16:52:27.169 
debug 2024-10-04T14:52:27.169 net-1     controller.cpp:4215           integrate_received_q ] set received qc: {"block_num":397510466,"is_strong_qc":true} into claimed block 397510466 17b18742b890a90bc75be74baacc27f278f03bdef6e083c5edc8e8627937149b
2024-10-04 16:52:27.169 
debug 2024-10-04T14:52:27.169 net-1     fork_database.cpp:353         set_pending_savanna_ ] set pending savanna lib 397510465: 17b18741d53e330193d5409116c4e5476e81b2c914893a1a701363970822fa54
2024-10-04 16:52:27.169 
debug 2024-10-04T14:52:27.169 net-1     net_plugin.cpp:3816           operator()           ] validated block header, broadcasting immediately, connection - 1, blk num = 397510467, id = 17b187436f3e0319a0842aad6f182acc78f37a2f145fd1a1197857a8a25693c4
2024-10-04 16:52:27.169 
debug 2024-10-04T14:52:27.169 net-1     net_plugin.cpp:2694           bcast_block          ] bcast block 397510467
2024-10-04 16:52:27.169 
debug 2024-10-04T14:52:27.169 net-1     net_plugin.cpp:3822           operator()           ] posting block 397510467 to app thread
2024-10-04 16:52:27.175 
debug 2024-10-04T14:52:27.175 net-1     controller.cpp:4215           integrate_received_q ] set received qc: {"block_num":397510467,"is_strong_qc":true} into claimed block 397510467 17b187436f3e0319a0842aad6f182acc78f37a2f145fd1a1197857a8a25693c4
2024-10-04 16:52:27.175 
debug 2024-10-04T14:52:27.175 net-1     fork_database.cpp:353         set_pending_savanna_ ] set pending savanna lib 397510466: 17b18742b890a90bc75be74baacc27f278f03bdef6e083c5edc8e8627937149b
2024-10-04 16:52:27.175 
debug 2024-10-04T14:52:27.175 net-1     net_plugin.cpp:3816           operator()           ] validated block header, broadcasting immediately, connection - 1, blk num = 397510468, id = 17b18744ca34b7a51ef26133f8f4fb7b8d4cc79f481b6d48a86ca043f024b608
2024-10-04 16:52:27.175 
debug 2024-10-04T14:52:27.175 net-1     net_plugin.cpp:2694           bcast_block          ] bcast block 397510468
2024-10-04 16:52:27.175 
debug 2024-10-04T14:52:27.175 net-1     net_plugin.cpp:3822           operator()           ] posting block 397510468 to app thread
2024-10-04 16:52:27.180 
debug 2024-10-04T14:52:27.179 net-1     controller.cpp:4215           integrate_received_q ] set received qc: {"block_num":397510468,"is_strong_qc":true} into claimed block 397510468 17b18744ca34b7a51ef26133f8f4fb7b8d4cc79f481b6d48a86ca043f024b608
2024-10-04 16:52:27.180 
debug 2024-10-04T14:52:27.179 net-1     fork_database.cpp:353         set_pending_savanna_ ] set pending savanna lib 397510467: 17b187436f3e0319a0842aad6f182acc78f37a2f145fd1a1197857a8a25693c4
2024-10-04 16:52:27.180 
debug 2024-10-04T14:52:27.180 net-1     net_plugin.cpp:3816           operator()           ] validated block header, broadcasting immediately, connection - 1, blk num = 397510469, id = 17b1874532e7a2634c13952bf9be08bd4db05c44606e51b9c3cd3021d380be84
2024-10-04 16:52:27.180 
debug 2024-10-04T14:52:27.180 net-1     net_plugin.cpp:2694           bcast_block          ] bcast block 397510469
2024-10-04 16:52:27.180 
debug 2024-10-04T14:52:27.180 net-1     net_plugin.cpp:3822           operator()           ] posting block 397510469 to app thread
2024-10-04 16:52:27.605 
debug 2024-10-04T14:52:27.605 net-2     net_plugin.cpp:2244           sync_timeout         ] ["peer1.eosphere.io:9876 - 922061b" - 3 198.50.156.24:9876] sync timeout
2024-10-04 17:10:49.718 
terminate called after throwing an instance of 'Serialization::FatalSerializationException'