XRPLF / rippled

Decentralized cryptocurrency blockchain daemon implementing the XRP Ledger protocol in C++
https://xrpl.org
ISC License
4.51k stars 1.46k forks source link

Segmentation fault in txs_iter_impl function (Version: 1.6.0) #3689

Closed madshell closed 3 years ago

madshell commented 3 years ago

Issue Description

We're running a full history Rippled 1.6.0 and we have multiple processes calling getLedger API (by incrementing the block number). Randomly, we get rippled crashes / segmentation faults. This is potentially a very dangerous issue.

We have the impression that it's the result of bad handling of some concurrent memory access.

Steps to Reproduce

Here is the way we call the API from Javascript:

api.getLedger({
      ledgerVersion: number,
      includeTransactions: true,
      includeAllData: true,
}).then((res) => {});

Expected Result

We except not to crash the node.

Actual Result

Here is the complete backtrace obtained with gdb:

(gdb) bt
#0  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count (__r=..., this=0x7ffd3ca78b00) at /usr/include/c++/7/bits/shared_ptr_base.h:691
#1  std::__shared_ptr<ripple::SHAMapAbstractNode, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0x7ffd3ca78af8) at /usr/include/c++/7/bits/shared_ptr_base.h:1121
#2  std::shared_ptr<ripple::SHAMapAbstractNode>::shared_ptr (this=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:119
#3  std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>::pair (this=<optimized out>) at /usr/include/c++/7/bits/stl_pair.h:303
#4  std::_Construct<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> const&> (__p=<optimized out>)
    at /usr/include/c++/7/bits/stl_construct.h:75
#5  std::__uninitialized_copy<false>::__uninit_copy<std::_Deque_iterator<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> const&, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> const*>, std::_Deque_iterator<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>&, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>*> > (__result=..., __first=..., __last=...)
    at /usr/include/c++/7/bits/stl_uninitialized.h:83
#6  std::uninitialized_copy<std::_Deque_iterator<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> const&, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> const*>, std::_Deque_iterator<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>&, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>*> > (__result=..., __first=..., __last=...) at /usr/include/c++/7/bits/stl_uninitialized.h:134
#7  std::__uninitialized_copy_a<std::_Deque_iterator<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> const&, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> const*>, std::_Deque_iterator<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>&, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>*>, std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> > (__result=...,
    __first=..., __last=...) at /usr/include/c++/7/bits/stl_uninitialized.h:289
#8  std::deque<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::allocator<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> > >::deque (this=0x7ffd3c3d4530,
    __x=...) at /usr/include/c++/7/bits/stl_deque.h:950
#9  0x00005555571156d5 in std::stack<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::deque<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID>, std::allocator<std::pair<std::shared_ptr<ripple::SHAMapAbstractNode>, ripple::SHAMapNodeID> > > >::stack (this=0x7ffd3c3d4530) at /usr/include/c++/7/bits/stl_stack.h:99
#10 ripple::SHAMap::const_iterator::const_iterator (this=0x7ffd3c3d4530) at /root/rippled/src/ripple/shamap/SHAMap.h:535
#11 ripple::Ledger::txs_iter_impl::txs_iter_impl (this=0x7ffd3c3d4520) at /root/rippled/src/ripple/app/ledger/Ledger.cpp:135

Environment

Ubuntu 18.04.5 LTS Intel(R) Xeon(R) CPU E5-2620 v3 RAM 80GB 20TB (SSDs)

Supporting Files

rippled.cfg

cjcobb23 commented 3 years ago

@madshell There should be more to this stack trace. I don't think frame 11 could ever possibly be the bottom of the stack. There should be some mention of the JobQueue at the very least, since the JobQueue is what calls the functions to handle the RPC. Any chance you can get a full stack trace for the thread that segfaults?

madshell commented 3 years ago

@madshell There should be more to this stack trace. I don't think frame 11 could ever possibly be the bottom of the stack. There should be some mention of the JobQueue at the very least, since the JobQueue is what calls the functions to handle the RPC. Any chance you can get a full stack trace for the thread that segfaults?

crash_rippled.txt

cjcobb23 commented 3 years ago

@madshell Thanks for the full backtrace. In the full backtrace, do you know which thread actually crashed?

I am trying to reproduce this issue on my side and want to mirror what you are doing as much as possible. getLedger just calls the ledger RPC internally. I have a script that is calling ledger in a loop, with "transactions":True, "expand": True, incrementing the ledger sequence number (ledgerVersion in the RippleAPI) by one each time. I am running several of these concurrently. Does this model what you are doing? Are the multiple processes looping over the same range of ledgers, such that two different processes will call getLedger with the same ledgerVersion?

madshell commented 3 years ago

Are the multiple processes looping over the same range of ledgers, such that two different processes will call getLedger with the same ledgerVersion?

yes (but not sure it's what makes it crash)

madshell commented 3 years ago

Here is a new crash (that I produced with the JS script):

Thread 51 "JobQueue" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffed7fa700 (LWP 4574)]
0x00000000027fd442 in __dynamic_cast ()

result of: thread apply all bt

thread51_crash.txt

madshell commented 3 years ago

I also managed to get another crash (not a SIGSEGV), I don't know if it could be related:

double free or corruption (fasttop)

Thread 49 "JobQueue" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffde7fc700 (LWP 15213)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

stacktrace of threads: SIGABRT.txt

cjcobb23 commented 3 years ago

@madshell thanks for all of these stack traces; they are very helpful. The crashing nodes are all running rippled 1.6.0, correct? Did you install a precompiled binary or did you build from source?

madshell commented 3 years ago

@cjcobb23 yes both nodes are running 1.6.0. I installed with precompiled at first but I also compiled 1.6.0 to try make things healthier (without success). The stack traces are from precompiled running on Debian 10 but crashes happens with both precompiled and build from source.

madshell commented 3 years ago

I successfully got a new segfault running queries much slower

madshell commented 3 years ago

@cjcobb23 Apparently, I can only reproduce this if our importer processes are runnning. The only RPC functions that we call are "getLedgerVersion" and "getLedger". I will investigate more tomorrow and get back to you with a proper way to make things crash. Could be great to have a more direct way to communicate.

cjcobb23 commented 3 years ago

@madshell you can email me at ccobb@ripple.com

cjcobb23 commented 3 years ago

@madshell this should be fixed in 1.7-b8 (current tip of develop branch). Can you test and confirm?

madshell commented 3 years ago

@cjcobb23 After compiling 1.7-b8, I tested it intensively for more than 30 minutes with aggressive parameters and I can't reproduce the bug anymore. Well done guys!

madshell commented 3 years ago

@cjcobb23 1.7 binary is a lot more heavy (1.4GB) than compiled 1.6 (400MB), is this expected? Binaries from repositories seems a lot lighter also, probably compilation parameters?

ximinez commented 3 years ago

If you built the binary yourself, the size difference may just be a result of the difference between a Debug & Release build, or a static & non-static build. I don't know offhand which options are used to build the packages.