Open centromere opened 9 months ago
Looks like the main thread died or deadlocked. Can you attach to it with a debugger and get backtraces of all the threads?
I've generated a stack trace of all the threads with the following command:
gdb --pid=1 -ex "thread apply all bt" --batch
Can you try this patch and let us know if it fixes the issue:
diff --git a/libraries/chain/include/eosio/chain/thread_utils.hpp b/libraries/chain/include/eosio/chain/thread_utils.hpp
index d3e9e8a26..5abcb15b3 100644
--- a/libraries/chain/include/eosio/chain/thread_utils.hpp
+++ b/libraries/chain/include/eosio/chain/thread_utils.hpp
@@ -160,8 +160,9 @@ namespace eosio { namespace chain {
template<typename F>
auto post_async_task( boost::asio::io_context& ioc, F&& f ) {
auto task = std::make_shared<std::packaged_task<decltype( f() )()>>( std::forward<F>( f ) );
+ auto fut = task->get_future();
boost::asio::post( ioc, [task]() { (*task)(); } );
- return task->get_future();
+ return fut;
}
} } // eosio::chain
Thank you @heifner. I have applied the patch and am in the process of testing it. I cannot reproduce the issue on demand, so it may be a few days before I have a result.
No luck with that patch.
@centromere I've been trying to reproduce this issue but so far have had no success.
When you encounter this problem, are you performing any actions on the node? For example: are you periodically querying the get_info
HTTP RPC endpoint, or have a state_history client connected and streaming results?
Also, any additional information about the runtime environment -- even CPU and kernel version being used -- could be helpful.
A periodic request to v1/chain/get_info
does indeed occur. I am not aware of a streaming state_history
client at this time. You can find the image I am using here.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
Linux ... 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7502P 32-Core Processor
stepping : 0
microcode : 0x830107a
cpu MHz : 2200.000
cache size : 512 KB
physical id : 0
siblings : 64
core id : 0
cpu cores : 32
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso
bogomips : 4990.44
TLB size : 3072 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
I am syncing from genesis, and the unhealthy condition occurs at this block:
Hello. I am running nodeos 5.0.0, compiled with GCC 11 and LLVM 11. Randomly throughout the day the node will stop responding to certain HTTP requests (but not others), and it also stops responding to most unix signals:
Here are some recent logs from stdout:
nodeos
is being invoked in this manner:Does anyone know what could be going wrong?