nodeos randomly becomes unresponsive, consumes CPU

centromere commented 9 months ago

Hello. I am running nodeos 5.0.0, compiled with GCC 11 and LLVM 11. Randomly throughout the day the node will stop responding to certain HTTP requests (but not others), and it also stops responding to most unix signals:

$ curl -q http://[...]:8888
{"code":404,"message":"Not Found","error":{"code":0,"name":"exception","what":"unspecified","details":[{"message":"Unknown Endpoint","file":"beast_http_session.hpp","line_number":185,"method":"handle_request"}]}}

# curl -v http://[...]:8888/v1/chain/get_info
*   Trying ...:8888...
* Connected to ... (...) port 8888 (#0)
> GET /v1/chain/get_info HTTP/1.1
> Host: [...]:8888
> User-Agent: curl/7.81.0
> Accept: */*
>
(Timeout occurs)

Here are some recent logs from stdout:

info  2024-02-08T20:40:17.556 nodeos    producer_plugin.cpp:743       on_incoming_block    ] Received block 2020d6baa648f43f... #46431000 @ 2019-03-07T22:35:57.000 signed by eosbeijingbp [trxs: 13, lib: 46430669, confirmed: 0, net: 2776, cpu: 10797, elapsed: 4714, time: 6961, latency: 155426660555 ms]
info  2024-02-08T20:40:17.556 nodeos    producer_plugin.cpp:743       on_incoming_block    ] Received block 2020d6baa648f43f... #46431000 @ 2019-03-07T22:35:57.000 signed by eosbeijingbp [trxs: 13, lib: 46430669, confirmed: 0, net: 2776, cpu: 10797, elapsed: 4714, time: 6961, latency: 155426660555 ms]
info  2024-02-08T20:40:29.275 nodeos    producer_plugin.cpp:743       on_incoming_block    ] Received block 9ef53246f7cf0af9... #46432000 @ 2019-03-07T22:44:17.000 signed by eosauthority [trxs: 18, lib: 46431665, confirmed: 0, net: 2336, cpu: 9184, elapsed: 6011, time: 9737, latency: 155426172274 ms]
info  2024-02-08T20:40:29.275 nodeos    producer_plugin.cpp:743       on_incoming_block    ] Received block 9ef53246f7cf0af9... #46432000 @ 2019-03-07T22:44:17.000 signed by eosauthority [trxs: 18, lib: 46431665, confirmed: 0, net: 2336, cpu: 9184, elapsed: 6011, time: 9737, latency: 155426172274 ms]
error 2024-02-08T21:12:52.685 net-5     net_plugin.cpp:3006           operator()           ] ["peer1.eosphere.io:9876 - 1316ef2" - 5 198.50.156.24:9876] Closing connection
error 2024-02-08T21:12:52.899 net-0     net_plugin.cpp:2976           operator()           ] ["tf-eos-mainnet-api2:9876 - bf80c65" - 3 141.95.255.99:9876] Error reading message: Connection reset by peer
error 2024-02-08T21:12:52.899 net-0     net_plugin.cpp:3006           operator()           ] ["tf-eos-mainnet-api2:9876 - bf80c65" - 3 141.95.255.99:9876] Closing connection
error 2024-02-08T21:12:53.315 net-3     net_plugin.cpp:3006           operator()           ] ["135.148.55.208:9876 - a9df975" - 4 135.148.130.0:9876] Closing connection
error 2024-02-08T21:12:54.443 net-3     net_plugin.cpp:3006           operator()           ] ["api.eossupport.io:9876 - 7d48717" - 2 135.148.168.190:9876] Closing connection
error 2024-02-08T22:33:42.706 net-4     net_plugin.cpp:3006           operator()           ] ["peer1.eosphere.io:9876 - 1316ef2" - 5 198.50.156.24:9876] Closing connection
error 2024-02-08T22:33:43.074 net-6     net_plugin.cpp:2976           operator()           ] ["tf-eos-mainnet-api2:9876 - bf80c65" - 3 141.95.255.99:9876] Error reading message: Connection reset by peer
error 2024-02-08T22:33:43.074 net-6     net_plugin.cpp:3006           operator()           ] ["tf-eos-mainnet-api2:9876 - bf80c65" - 3 141.95.255.99:9876] Closing connection
error 2024-02-08T22:33:43.362 net-5     net_plugin.cpp:3006           operator()           ] ["135.148.55.208:9876 - a9df975" - 4 135.148.130.0:9876] Closing connection
error 2024-02-08T22:33:44.490 net-7     net_plugin.cpp:3006           operator()           ] ["api.eossupport.io:9876 - 7d48717" - 2 135.148.168.190:9876] Closing connection
error 2024-02-08T23:34:53.967 net-7     net_plugin.cpp:3006           operator()           ] ["tf-eos-mainnet-api2:9876 - bf80c65" - 3 141.95.255.99:9876] Closing connection
error 2024-02-08T23:34:54.523 net-5     net_plugin.cpp:3006           operator()           ] ["api.eossupport.io:9876 - 7d48717" - 2 135.148.168.190:9876] Closing connection

nodeos is being invoked in this manner:

nodeos
--data-dir=/var/db/leap
--config-dir=/etc/leap/config
--plugin=eosio::chain_plugin
--plugin=eosio::chain_api_plugin
--protocol-features-dir=/etc/leap/protocol_features
--wasm-runtime=eos-vm-jit
--abi-serializer-max-time-ms=10000
--chain-state-db-size-mb=32768
--chain-threads=16
--eos-vm-oc-compile-threads=4
--disable-replay-opts
--plugin=eosio::http_plugin
--http-server-address=[::]:8888
--http-max-response-time-ms=1000
--verbose-http-errors
--http-validate-host=false
--http-threads=16
--plugin=eosio::net_plugin
--plugin=eosio::net_api_plugin
--p2p-listen-endpoint=:9876
--p2p-server-address=:9876
--p2p-peer-address=peer.main.alohaeos.com:9876
--p2p-peer-address=p2p.genereos.io:9876
--p2p-peer-address=mainnet.eosamsterdam.net:9876
--p2p-peer-address=peer1.eosphere.io:9876
--p2p-peer-address=p2p.eossupport.io:9876
--net-threads=8
--sync-fetch-span=5000
--plugin=eosio::producer_plugin
--plugin=eosio::producer_api_plugin
--snapshots-dir=/var/db/leap/snapshots
--plugin=eosio::state_history_plugin
--state-history-dir=/var/db/leap/block/state-history
--trace-history
--chain-state-history
--state-history-endpoint=[::]:8080
--trace-history-debug-mode
--logconf=/etc/leap/config/logging.json

# free -m
               total        used        free      shared  buff/cache   available
Mem:          515637      467299       28689       17543       19649       24927
Swap:              0           0           0

Does anyone know what could be going wrong?

heifner commented 9 months ago

Looks like the main thread died or deadlocked. Can you attach to it with a debugger and get backtraces of all the threads?

centromere commented 9 months ago

I've generated a stack trace of all the threads with the following command:

gdb --pid=1 -ex "thread apply all bt" --batch

gdb-output.txt

heifner commented 9 months ago

Can you try this patch and let us know if it fixes the issue:

diff --git a/libraries/chain/include/eosio/chain/thread_utils.hpp b/libraries/chain/include/eosio/chain/thread_utils.hpp
index d3e9e8a26..5abcb15b3 100644
--- a/libraries/chain/include/eosio/chain/thread_utils.hpp
+++ b/libraries/chain/include/eosio/chain/thread_utils.hpp
@@ -160,8 +160,9 @@ namespace eosio { namespace chain {
    template<typename F>
    auto post_async_task( boost::asio::io_context& ioc, F&& f ) {
       auto task = std::make_shared<std::packaged_task<decltype( f() )()>>( std::forward<F>( f ) );
+      auto fut = task->get_future();
       boost::asio::post( ioc, [task]() { (*task)(); } );
-      return task->get_future();
+      return fut;
    }

 } } // eosio::chain

centromere commented 9 months ago

Thank you @heifner. I have applied the patch and am in the process of testing it. I cannot reproduce the issue on demand, so it may be a few days before I have a result.

centromere commented 9 months ago

No luck with that patch.

gdb-output.txt

spoonincode commented 8 months ago

@centromere I've been trying to reproduce this issue but so far have had no success.

When you encounter this problem, are you performing any actions on the node? For example: are you periodically querying the get_info HTTP RPC endpoint, or have a state_history client connected and streaming results?

Also, any additional information about the runtime environment -- even CPU and kernel version being used -- could be helpful.

centromere commented 8 months ago

A periodic request to v1/chain/get_info does indeed occur. I am not aware of a streaming state_history client at this time. You can find the image I am using here.

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:    22.04
Codename:   jammy

Linux ... 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7502P 32-Core Processor
stepping        : 0
microcode       : 0x830107a
cpu MHz         : 2200.000
cache size      : 512 KB
physical id     : 0
siblings        : 64
core id         : 0
cpu cores       : 32
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso
bogomips        : 4990.44
TLB size        : 3072 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

I am syncing from genesis, and the unhealthy condition occurs at this block:

AntelopeIO / leap

nodeos randomly becomes unresponsive, consumes CPU #2230