Closed madshell closed 3 years ago
@madshell There should be more to this stack trace. I don't think frame 11 could ever possibly be the bottom of the stack. There should be some mention of the JobQueue
at the very least, since the JobQueue
is what calls the functions to handle the RPC. Any chance you can get a full stack trace for the thread that segfaults?
@madshell There should be more to this stack trace. I don't think frame 11 could ever possibly be the bottom of the stack. There should be some mention of the
JobQueue
at the very least, since theJobQueue
is what calls the functions to handle the RPC. Any chance you can get a full stack trace for the thread that segfaults?
@madshell Thanks for the full backtrace. In the full backtrace, do you know which thread actually crashed?
I am trying to reproduce this issue on my side and want to mirror what you are doing as much as possible. getLedger
just calls the ledger
RPC internally. I have a script that is calling ledger
in a loop, with "transactions":True, "expand": True
, incrementing the ledger sequence number (ledgerVersion
in the RippleAPI) by one each time. I am running several of these concurrently. Does this model what you are doing? Are the multiple processes looping over the same range of ledgers, such that two different processes will call getLedger
with the same ledgerVersion
?
Are the multiple processes looping over the same range of ledgers, such that two different processes will call
getLedger
with the sameledgerVersion
?
yes (but not sure it's what makes it crash)
Here is a new crash (that I produced with the JS script):
Thread 51 "JobQueue" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffed7fa700 (LWP 4574)]
0x00000000027fd442 in __dynamic_cast ()
result of: thread apply all bt
I also managed to get another crash (not a SIGSEGV), I don't know if it could be related:
double free or corruption (fasttop)
Thread 49 "JobQueue" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffde7fc700 (LWP 15213)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
stacktrace of threads: SIGABRT.txt
@madshell thanks for all of these stack traces; they are very helpful. The crashing nodes are all running rippled 1.6.0, correct? Did you install a precompiled binary or did you build from source?
@cjcobb23 yes both nodes are running 1.6.0. I installed with precompiled at first but I also compiled 1.6.0 to try make things healthier (without success). The stack traces are from precompiled running on Debian 10 but crashes happens with both precompiled and build from source.
I successfully got a new segfault running queries much slower
@cjcobb23 Apparently, I can only reproduce this if our importer processes are runnning. The only RPC functions that we call are "getLedgerVersion" and "getLedger". I will investigate more tomorrow and get back to you with a proper way to make things crash. Could be great to have a more direct way to communicate.
@madshell you can email me at ccobb@ripple.com
@madshell this should be fixed in 1.7-b8 (current tip of develop branch). Can you test and confirm?
@cjcobb23 After compiling 1.7-b8, I tested it intensively for more than 30 minutes with aggressive parameters and I can't reproduce the bug anymore. Well done guys!
@cjcobb23 1.7 binary is a lot more heavy (1.4GB) than compiled 1.6 (400MB), is this expected? Binaries from repositories seems a lot lighter also, probably compilation parameters?
If you built the binary yourself, the size difference may just be a result of the difference between a Debug & Release build, or a static & non-static build. I don't know offhand which options are used to build the packages.
Issue Description
We're running a full history Rippled 1.6.0 and we have multiple processes calling getLedger API (by incrementing the block number). Randomly, we get rippled crashes / segmentation faults. This is potentially a very dangerous issue.
We have the impression that it's the result of bad handling of some concurrent memory access.
Steps to Reproduce
Here is the way we call the API from Javascript:
Expected Result
We except not to crash the node.
Actual Result
Here is the complete backtrace obtained with gdb:
Environment
Ubuntu 18.04.5 LTS Intel(R) Xeon(R) CPU E5-2620 v3 RAM 80GB 20TB (SSDs)
Supporting Files
rippled.cfg