DeFiCh / ain

DeFi Blockchain - enabling decentralized finance on Bitcoin
https://defichain.com
MIT License
408 stars 122 forks source link

segmentation fault : rocksdb : Too many open files #3098

Closed skutcher closed 1 month ago

skutcher commented 1 month ago

Summary

Steps to Reproduce

Environment

RUST_BACKTRACE=full

2024-10-16T18:59:04Z [ain_grpc] Starting JSON Eth RPC server at 127.0.0.1:8551
2024-10-16T18:59:04Z [ain_grpc] Init rs ws services
2024-10-16T18:59:04Z [ain_grpc] Starting ws server at 127.0.0.1:8553
2024-10-16T18:59:04Z init message: Done loading
2024-10-16T18:59:04Z 0 addresses found from DNS seeds
2024-10-16T18:59:04Z dnsseed thread exit
2024-10-16T18:59:04Z New outbound peer connected: version: 70038, blocks=4426259, peer=0
2024-10-16T18:59:05Z New outbound peer connected: version: 70038, blocks=4426259, peer=1
2024-10-16T18:59:05Z Synchronizing blockheaders, height: 4419682 (~99.85%)
2024-10-16T18:59:05Z New outbound peer connected: version: 70038, blocks=4426259, peer=2
2024-10-16T18:59:06Z Synchronizing blockheaders, height: 4421682 (~99.90%)
2024-10-16T18:59:06Z [vsdb_core::common::engines::rocks_db] rocksdb write_buffer_size per column family = 512MB
thread '<unnamed>' panicked at /github/home/.cargo/git/checkouts/vsdb-80d7017a79ae04c0/020aaf6/core/src/common/engines/rocks_db.rs:31:67:
called `Result::unwrap()` on an `Err` value: SimpleError { msg: SimpleMsg { err: "", file: "/github/home/.cargo/git/checkouts/vsdb-80d7017a79ae04c0/020aaf6/core/src/common/engines/rocks_db.rs", line: 436, column: 57 }, cause: Some(SimpleError { msg: SimpleMsg { err: "IO error: While open a file for random read: /home/smo/.defi/evm/.vsdb/002338.sst: Too many open files", file: "/github/home/.cargo/git/checkouts/vsdb-80d7017a79ae04c0/020aaf6/core/src/common/engines/rocks_db.rs", line: 436, column: 57 }, cause: None }) }
stack backtrace:
   0:     0x57906bc108af - std::backtrace_rs::backtrace::libunwind::trace::h1a07e5dba0da0cd2
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
   1:     0x57906bc108af - std::backtrace_rs::backtrace::trace_unsynchronized::h61b9b8394328c0bc
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x57906bc108af - std::sys_common::backtrace::_print_fmt::h1c5e18b460934cff
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x57906bc108af - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x57906b9e20bb - core::fmt::rt::Argument::fmt::h07af2b4071d536cd
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/rt.rs:165:63
   5:     0x57906b9e20bb - core::fmt::write::hc090a2ffd6b28c4a
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/mod.rs:1157:21
   6:     0x57906bbe3b52 - std::io::Write::write_fmt::h8898bac6ff039a23
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/io/mod.rs:1832:15
   7:     0x57906bc12399 - std::sys_common::backtrace::_print::h4e80c5803d4ee35b
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x57906bc12399 - std::sys_common::backtrace::print::ha96650907276675e
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x57906bc11bbe - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:271:22
  10:     0x57906bc12e59 - std::panicking::default_hook::h207342be97478370
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:298:9
  11:     0x57906bc12e59 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:795:13
  12:     0x57906bc12702 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664:13
  13:     0x57906bc12659 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x57906bc12646 - rust_begin_unwind
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
  15:     0x57906ab74b72 - core::panicking::panic_fmt::hdc63834ffaaefae5
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
  16:     0x57906ab74f85 - core::result::unwrap_failed::h82b551e0ff2b2176
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1654:5
  17:     0x57906bc5c398 - core::ops::function::FnOnce::call_once::h9be83c72fcb34d3b
  18:     0x57906bc634e8 - once_cell::imp::OnceCell<T>::initialize::{{closure}}::h16f939f811f483c4
  19:     0x57906bb29f3f - once_cell::imp::initialize_or_wait::h63f27d139639c331
  20:     0x57906ab9287b - once_cell::imp::OnceCell<T>::initialize::h15637d85abd0c950
  21:     0x57906bc66130 - core::ops::function::FnOnce::call_once::h3b47c4b74e2b2ad2
  22:     0x57906bc638db - once_cell::imp::OnceCell<T>::initialize::{{closure}}::ha9f41780e052a633
  23:     0x57906bb29f3f - once_cell::imp::initialize_or_wait::h63f27d139639c331
  24:     0x57906ab976ab - once_cell::imp::OnceCell<T>::initialize::h8e0d2283277274f6
  25:     0x57906bc8fe24 - vsdb_trie_db::MptStore::trie_restore::h3ca9b979e2e376ac
  26:     0x57906b50e865 - ain_evm::backend::EVMBackend::from_root::h171eaddb7be2fd68
  27:     0x57906b41dbd5 - cxxbridge1$evm_try_unsafe_create_template
  28:     0x57906b259f20 - _Z30evm_try_unsafe_create_templateR19CrossBoundaryResultmN4rust10cxxbridge13StrEjmm
  29:     0x57906ac81036 - _ZN15CScopedTemplate6CreateEmNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEjmm
  30:     0x57906b1e8ec4 - _ZN11CChainState12ConnectBlockERK6CBlockR16CValidationStateP11CBlockIndexR15CCoinsViewCacheR13CCustomCSViewRK12CChainParamsRbb
  31:     0x57906b1f81cb - _ZN11CChainState10ConnectTipER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockER12ConnectTraceR29DisconnectedBlockTransactions
  32:     0x57906b1f9b99 - _ZN11CChainState21ActivateBestChainStepER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockERbR12ConnectTrace
  33:     0x57906b1fc4d1 - _ZN11CChainState17ActivateBestChainER16CValidationStateRK12CChainParamsSt10shared_ptrIK6CBlockE
  34:     0x57906b2068ea - _Z15ProcessNewBlockRK12CChainParamsSt10shared_ptrIK6CBlockEbPb
  35:     0x57906b03eafd - _ZL14ProcessMessageP5CNodeRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEER11CDataStreamlRK12CChainParamsP8CConnmanRKSt6atomicIbE
  36:     0x57906b033bdc - _ZN19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
  37:     0x57906b042397 - _ZThn8_N19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
  38:     0x57906b0198a9 - _ZN8CConnman20ThreadMessageHandlerEv
  39:     0x57906b029c3a - _ZNSt17_Function_handlerIFvvESt5_BindIFM8CConnmanFvvEPS2_EEE9_M_invokeERKSt9_Any_data
  40:     0x57906b001136 - _Z11TraceThreadISt8functionIFvvEEEvPKcT_
  41:     0x57906b029dfc - _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPKcSt8functionIFvvEEES4_S7_EEEEE6_M_runEv
  42:     0x57906c543194 - execute_native_thread_routine
  43:     0x78b6928db39d - <unknown>
  44:     0x78b69296049c - <unknown>
  45:                0x0 - <unknown>
thread '<unnamed>' panicked at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cxx-1.0.128/src/unwind.rs:37:9:
panic in ffi function ain_rs_exports::ffi::evm_try_unsafe_create_template, aborting.
stack backtrace:
   0:     0x57906bc108af - std::backtrace_rs::backtrace::libunwind::trace::h1a07e5dba0da0cd2
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
   1:     0x57906bc108af - std::backtrace_rs::backtrace::trace_unsynchronized::h61b9b8394328c0bc
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x57906bc108af - std::sys_common::backtrace::_print_fmt::h1c5e18b460934cff
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x57906bc108af - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x57906b9e20bb - core::fmt::rt::Argument::fmt::h07af2b4071d536cd
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/rt.rs:165:63
   5:     0x57906b9e20bb - core::fmt::write::hc090a2ffd6b28c4a
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/mod.rs:1157:21
   6:     0x57906bbe3b52 - std::io::Write::write_fmt::h8898bac6ff039a23
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/io/mod.rs:1832:15
   7:     0x57906bc12399 - std::sys_common::backtrace::_print::h4e80c5803d4ee35b
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x57906bc12399 - std::sys_common::backtrace::print::ha96650907276675e
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x57906bc11bbe - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:271:22
  10:     0x57906bc12e59 - std::panicking::default_hook::h207342be97478370
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:298:9
  11:     0x57906bc12e59 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:795:13
  12:     0x57906bc12702 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664:13
  13:     0x57906bc12659 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171:18
  14:     0x57906bc12646 - rust_begin_unwind
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
  15:     0x57906ab74b72 - core::panicking::panic_fmt::hdc63834ffaaefae5
                               at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
  16:     0x57906ab75be7 - <cxx::unwind::Guard as core::ops::drop::Drop>::drop::h3c636553f50760c4
  17:     0x57906b41e635 - cxxbridge1$evm_try_unsafe_create_template
  18:     0x57906b259f20 - _Z30evm_try_unsafe_create_templateR19CrossBoundaryResultmN4rust10cxxbridge13StrEjmm
  19:     0x57906ac81036 - _ZN15CScopedTemplate6CreateEmNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEjmm
  20:     0x57906b1e8ec4 - _ZN11CChainState12ConnectBlockERK6CBlockR16CValidationStateP11CBlockIndexR15CCoinsViewCacheR13CCustomCSViewRK12CChainParamsRbb
  21:     0x57906b1f81cb - _ZN11CChainState10ConnectTipER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockER12ConnectTraceR29DisconnectedBlockTransactions
  22:     0x57906b1f9b99 - _ZN11CChainState21ActivateBestChainStepER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockERbR12ConnectTrace
  23:     0x57906b1fc4d1 - _ZN11CChainState17ActivateBestChainER16CValidationStateRK12CChainParamsSt10shared_ptrIK6CBlockE
  24:     0x57906b2068ea - _Z15ProcessNewBlockRK12CChainParamsSt10shared_ptrIK6CBlockEbPb
  25:     0x57906b03eafd - _ZL14ProcessMessageP5CNodeRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEER11CDataStreamlRK12CChainParamsP8CConnmanRKSt6atomicIbE
  26:     0x57906b033bdc - _ZN19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
  27:     0x57906b042397 - _ZThn8_N19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
  28:     0x57906b0198a9 - _ZN8CConnman20ThreadMessageHandlerEv
  29:     0x57906b029c3a - _ZNSt17_Function_handlerIFvvESt5_BindIFM8CConnmanFvvEPS2_EEE9_M_invokeERKSt9_Any_data
  30:     0x57906b001136 - _Z11TraceThreadISt8functionIFvvEEEvPKcT_
  31:     0x57906b029dfc - _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPKcSt8functionIFvvEEES4_S7_EEEEE6_M_runEv
  32:     0x57906c543194 - execute_native_thread_routine
  33:     0x78b6928db39d - <unknown>
  34:     0x78b69296049c - <unknown>
  35:                0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:228:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
Aborted (core dumped)
skutcher commented 1 month ago

it seems btw to work with https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4268284.zip

but this seems not to be the problem, cause the fd count already drops to around 1600 before the error occurs

2024-10-16T20:52:58Z [vsdb_core::common::engines::rocks_db] rocksdb write_buffer_size per column family = 512MB
2024-10-16T20:52:58Z UpdateTip: height=4423469 hash=e41b0298011b8e76244eebe13db5bdd6e6c3a23bfe561fb8112114f98174698e date='2024-10-15T20:00:42Z' tx=34709360 log2_work=88.520461 progress=0.999525
*** buffer overflow detected ***: terminated
*** buffer overflow detected ***: terminated

script to measure open file descriptors

ps axco pid,comm | grep defid | xargs -n 2 -- bash -c 'echo -n "$1 $2 ";lsof -p $1 2>/dev/null | wc -l' argv0 (Please note that lsof | wc -l sums up a lot of duplicated entries (forked processes can share file handles etc).)

prasannavl commented 1 month ago

Related: https://github.com/facebook/rocksdb/issues/10367

prasannavl commented 1 month ago

Thanks for the detailed info @skutcher. We're looking at options to workaround this.

For now, please push the kernel limits and ulimits to higher. We've observed anything 16k+ usually works well. We'll look at a more refined fix soon.

skutcher commented 1 month ago

yfyi: my default system limit:

limit -Hn
524288
ulimit -Sn
1024

but as mentioned above if i increase the softlimit i got just buffer overflows

prasannavl commented 1 month ago

but as mentioned above if i increase the softlimit i got just buffer overflows

I assume this after it already failed and corrupted the data due to open files? Once the data is corrupt, all bets are off I'm afraid. Even if it doesn't error with something else, the integrity checks will fail to ensure the node doesn't start with bad data.

This is expected behavior. If you have issues with a clean data / snapshot, please let us know. We do run our machines with 1024 * 1024 or 16k as the file open limits depending on the system.

We do intend to workaround the rocksdb bug, but however, meanwhile the above should be a viable workaround. I've also expanded a bit more on https://github.com/DeFiCh/ain/issues/3101#issuecomment-2436546577

skutcher commented 1 month ago

but as mentioned above if i increase the softlimit i got just buffer overflows

I assume this after it already failed and corrupted the data due to open files? Once the data is corrupt, all bets are off I'm afraid. Even if it doesn't error with something else, the integrity checks will fail to ensure the node doesn't start with bad data.

I am pretty sure it was with an fresh snapshot of https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4423468.zip . I will check again to be 100% sure

prasannavl commented 1 month ago

We were able to test with stricts ulimit < 1024, and now the default should work well with the default of 1024 even though it's recommended to have higher limits.

Believe this issue should be resolved with the latest master.

prasannavl commented 1 month ago

Closing, as this is now resolved. Please feel free to reopen if you continue to have issues with >= 4.2.1.