driplineorg / dripline-python

python implementation of project8/dripline
Apache License 2.0
3 stars 6 forks source link

Segfault from C++ Logger #138

Closed nsoblath closed 3 years ago

nsoblath commented 3 years ago

In certain conditions we've encountered a segfault that appears to originate in the C++ logger. So far this has only been seen while running dl-python, hence it's being posted here.

Stacktrace

Here's the common part of the stacktrace, obtained by running python in GDB (the message being printed, shown in frames #6 and #7 is specific to each crash):

key-value-store_1     | Program received signal SIGSEGV, Segmentation fault.
key-value-store_1     | 0x00007f4bc906acb8 in cfunction_enter_call (func=0x7f4bc6e806d0) at Objects/methodobject.c:402
key-value-store_1     | 402 Objects/methodobject.c: No such file or directory.
key-value-store_1     | #0  0x00007f4bc906acb8 in cfunction_enter_call (func=0x7f4bc6e806d0) at Objects/methodobject.c:402
key-value-store_1     | #1  cfunction_vectorcall_O (func=func@entry=0x7f4bc6e806d0, args=0x7f4bc880a8c8, nargsf=nargsf@entry=1, kwnames=0x0) at Objects/methodobject.c:478
key-value-store_1     | #2  0x00007f4bc903eb4d in PyVectorcall_Call (callable=0x7f4bc6e806d0, tuple=<optimized out>, kwargs=<optimized out>) at Objects/call.c:199
key-value-store_1     | #3  0x00007f4bc7cb87c9 in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
key-value-store_1     | #4  0x00007f4bc7ccac4d in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
key-value-store_1     | #5  0x00007f4bc7e2ecde in std::ostream::flush() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
key-value-store_1     | #6  0x00007f4bc8061fd8 in scarab::logger::Private::logCout (this=0x55f60bf33040, level=scarab::logger::ELevel::eInfo, message="Connecting to <rabbit-broker:5672>", loc=...) at /usr/local/src/scarab/library/logger/logger.cc:175
key-value-store_1     | #7  0x00007f4bc806172b in scarab::logger::Log (this=0x7f4bc7c703e0 <_ZN8driplineL4dlogE>, level=scarab::logger::ELevel::eInfo, message="Connecting to <rabbit-broker:5672>", loc=...) at /usr/local/src/scarab/library/logger/logger.cc:346

Where the crash occurs

So far this has been seen in:

  1. dripline::core::send() run from Interface::get()
  2. dripline::service::start() run from dl-serve

The logging line that causes the crash is consistent for the interface case and for the service case. For each case, it's always the same line (frame #8, not pasted above), that causes the problem.

In the service case, the next frame is:

key-value-store_1     | #8  0x00007f4bc7bf4119 in dripline::service::start (this=0x55f60c3ec3f0) at /usr/local/src/library/service.cc:174

Code version

It seems to depend on the version being used. If I use the dl-python v4.5.1 image, there is no crash. If I use dl-python's develop image or a newer in-development image, there's a crash.

Verbosity

It also seems to depend on the verbosity. If I run dl-serve with -v or -vv, there's a crash. If I don't use those (less verbosity), there's no crash.

nsoblath commented 3 years ago

Replicating the problem

I'll lay out some steps to replicate the problem using the first-mesh example.

  1. In the first terminal, start the rabbit broker:
    docker-compose up rabbit-broker`

Interface crash

  1. In docker-compose.yaml, set the key-value-store/image to driplineorg/dripline-python:feature-sighand (which fixes the receiver-timeout problem).
  2. Make the key-value-store/command
    dl-serve
    -c
    /root/key-value-store.yaml

    so that the verbosity is low, thereby avoiding the service itself crashing.

  3. In a second terminal, run the service:
    docker-compose up key-value-store
  4. In a third terminal, get an interactive bash prompt:
    docker-compose exec key-value-store /bin/bash
  5. Start python in GDB, and then run the program once you get a GDB prompot:
    gdb python
    (gdb) run
  6. Run the following commands in python. The last should cause the crash, dropping back to GDB.
    >>> from dripline.core import Interface
    >>> the_interface = Interface(dripline_config={'auth-file':'/root/authentications.json'})
    >>> the_interface.get('peaches')
  7. Get a stack trace
    (gdb) bt

Service crash

  1. If it's still running from above, stop the service container in the second terminal with ctrl-c.
  2. Change the key-value-store command to:
    command:
        - gdb
        - --batch
        - --ex
        - run
        - --ex
        - bt
        - --args
        - python
        - /usr/local/bin/dl-serve
        - -c
        - /root/key-value-store.yaml
        - -vv
  3. Start the container:
    > docker-compose up key-value-store

    This should cause the crash and automatically print the stacktrace.

nsoblath commented 3 years ago

Interface crash with the first 20 frames in the stacktrace

>>> the_interface.get('peaches')

Program received signal SIGSEGV, Segmentation fault.
0x00007f3efaeb7138 in cfunction_enter_call (func=0x7f3ef8877720) at Objects/methodobject.c:402
402 Objects/methodobject.c: No such file or directory.
(gdb) bt
#0  0x00007f3efaeb7138 in cfunction_enter_call (func=0x7f3ef8877720) at Objects/methodobject.c:402
#1  cfunction_vectorcall_O (func=func@entry=0x7f3ef8877720, args=0x7f3efa90b4d8, nargsf=nargsf@entry=1, kwnames=0x0) at Objects/methodobject.c:478
#2  0x00007f3efae8ae7d in PyVectorcall_Call (callable=0x7f3ef8877720, tuple=<optimized out>, kwargs=<optimized out>) at Objects/call.c:200
#3  0x00007f3ef9ee8999 in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
#4  0x00007f3ef9ef9cbd in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
#5  0x00007f3efa05fcde in std::ostream::flush() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f3efa29987a in scarab::logger::Private::logCout (this=0x55bb2344db70, level=scarab::logger::ELevel::eDebug, 
    message="Sending request with routing key <peaches>", loc=...) at /usr/local/src/scarab/library/logger/logger.cc:175
#7  0x00007f3efa2990d1 in scarab::logger::Log (this=0x7f3ef9e9ce80 <_ZN8driplineL4dlogE>, level=scarab::logger::ELevel::eDebug, 
    message="Sending request with routing key <peaches>", loc=...) at /usr/local/src/scarab/library/logger/logger.cc:346
#8  0x00007f3ef9dadbb1 in dripline::core::send (this=0x55bb235bbca0, a_request=std::shared_ptr<class dripline::msg_request> (use count 4, weak count 0) = {...})
    at /usr/local/src/library/core.cc:182
#9  0x00007f3ef9ece39a in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
#10 0x00007f3ef9eddb38 in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
#11 0x00007f3ef9f08167 in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
#12 0x00007f3efae8a750 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x7f3efa50e360) at Objects/call.c:743
#13 PyCFunction_Call (func=0x7f3efa50e360, args=<optimized out>, kwargs=<optimized out>) at Objects/call.c:773
#14 0x00007f3efae89594 in _PyObject_MakeTpCall (callable=0x7f3efa50e360, args=<optimized out>, nargs=<optimized out>, keywords=0x0) at Objects/call.c:159
#15 0x00007f3efadd7e1a in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f3efa875c00, callable=0x7f3efa50e360) at ./Include/cpython/abstract.h:125
#16 _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f3efa875c00, callable=0x7f3efa50e360) at ./Include/cpython/abstract.h:115
#17 method_vectorcall (method=<optimized out>, args=0x7f3efa875c08, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:60
#18 0x00007f3efaf08324 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f3efa875c08, callable=0x7f3efa91d2c0) at ./Include/cpython/abstract.h:127
#19 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55bb2336a550) at Python/ceval.c:4963
#20 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3469
nsoblath commented 3 years ago

Service crash with the first 20 frames of the stacktrace

key-value-store_1     | 2021-06-22T19:31:44[INFO    ] dripline(113) -> about to start the service
key-value-store_1     | 402 Objects/methodobject.c: No such file or directory.
key-value-store_1     | 
key-value-store_1     | Program received signal SIGSEGV, Segmentation fault.
key-value-store_1     | 0x00007f7c559ea138 in cfunction_enter_call (func=0x7f7c539b4860) at Objects/methodobject.c:402
key-value-store_1     | #0  0x00007f7c559ea138 in cfunction_enter_call (func=0x7f7c539b4860) at Objects/methodobject.c:402
key-value-store_1     | #1  cfunction_vectorcall_O (func=func@entry=0x7f7c539b4860, args=0x7f7c54ea8208, nargsf=nargsf@entry=1, kwnames=0x0) at Objects/methodobject.c:478
key-value-store_1     | #2  0x00007f7c559bde7d in PyVectorcall_Call (callable=0x7f7c539b4860, tuple=<optimized out>, kwargs=<optimized out>) at Objects/call.c:200
key-value-store_1     | #3  0x00007f7c5462c999 in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
key-value-store_1     | #4  0x00007f7c5463dcbd in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
key-value-store_1     | #5  0x00007f7c547a3cde in std::ostream::flush() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
key-value-store_1     | #6  0x00007f7c549dd87a in scarab::logger::Private::logCout (this=0x560750ce3400, level=scarab::logger::ELevel::eInfo, message="Connecting to <rabbit-broker:5672>", loc=...) at /usr/local/src/scarab/library/logger/logger.cc:175
key-value-store_1     | #7  0x00007f7c549dd0d1 in scarab::logger::Log (this=0x7f7c545e2400 <_ZN8driplineL4dlogE>, level=scarab::logger::ELevel::eInfo, message="Connecting to <rabbit-broker:5672>", loc=...) at /usr/local/src/scarab/library/logger/logger.cc:346
key-value-store_1     | #8  0x00007f7c5456522d in dripline::service::start (this=0x5607511a7500) at /usr/local/src/library/service.cc:174
key-value-store_1     | #9  0x00007f7c54618cff in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
key-value-store_1     | #10 0x00007f7c54618ea8 in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
key-value-store_1     | #11 0x00007f7c5464c167 in ?? () from /usr/local/lib/python3.8/site-packages/_dripline.cpython-38-x86_64-linux-gnu.so
key-value-store_1     | #12 0x00007f7c559bd750 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x7f7c54ed7cc0) at Objects/call.c:743
key-value-store_1     | #13 PyCFunction_Call (func=0x7f7c54ed7cc0, args=<optimized out>, kwargs=<optimized out>) at Objects/call.c:773
key-value-store_1     | #14 0x00007f7c559bc594 in _PyObject_MakeTpCall (callable=0x7f7c54ed7cc0, args=<optimized out>, nargs=<optimized out>, keywords=0x0) at Objects/call.c:159
key-value-store_1     | #15 0x00007f7c5590ae1a in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x56075114c0c0, callable=0x7f7c54ed7cc0) at ./Include/cpython/abstract.h:125
key-value-store_1     | #16 _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x56075114c0c0, callable=0x7f7c54ed7cc0) at ./Include/cpython/abstract.h:115
key-value-store_1     | #17 method_vectorcall (method=<optimized out>, args=0x56075114c0c8, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:60
key-value-store_1     | #18 0x00007f7c55a3b324 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x56075114c0c8, callable=0x7f7c534696c0) at ./Include/cpython/abstract.h:127
key-value-store_1     | #19 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x560750badaf0) at Python/ceval.c:4963
key-value-store_1     | #20 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3469
nsoblath commented 3 years ago

What crashes and what doesn't?

Crashes:

Does not crash:

What was in the main tag?

It looks like the last thing to push to the main tag on Docker Hub (from GitHub Actions) was the release of v4.5.3: https://github.com/driplineorg/dripline-python/actions/runs/786505535. That was git tag a1460f34ce8b668dce355380f5daa5735dd2be69.

Based on the build log, it used dl-cpp v2.7.0 (driplineorg/dripline-cpp@d1b0eaa4602b3acdee116689ef22d5be1bfdd012).

What was in the v4.5.1 tag?

dl-python v4.5.1 is from commit 65162825d71df6de28eccaa6b9388c4e581885c0. This was pre-GitHub Actions, when Travis-CI was used to build the docker images. Scarab was v3.5.4 (project8/scarab@9431f47b1ff21711e950e392f31c345957b23eba).

dl-cpp was v2.4.0 (driplineorg/dripline-cpp@d1b0eaa4602b3acdee116689ef22d5be1bfdd012), which used scarab v3.2.1 (project8/scarab@29bd52c755910f1dbd7562e068a2aa5c26858263).

What changed?

dl-python v4.5.1 --> v4.5.3

dl-cpp v2.4.0 --> v2.7.0

scarab v3.2.1 --> v3.5.4

nsoblath commented 3 years ago

First from tracing changes: major logger update from Scarab v3.4.2, v3.4.4, v3.5.0, and v3.5.1 could be responsible. Maybe something's being deleted improperly.

Test: revert logger.hh and logger.cc to v3.4.1 and run the service test.

Result: it still crashes. The stacktrace confirms that the old logger code is being used in the test.

Other tests:

I un-reverted the logger changes back to the newer version.

Another thought: perhaps we haven't redirected the ostream and estream properly. I realized this while looking through pybind11 docs. I had forgotten about this point, in part because it's not more emphasized in those docs. I added a bunch of scoped stream redirects in scarab, and a few in dl-python. It still segfaults, but it's different now, and dying much earlier:

key-value-store_1     | 2021-06-23 01:58:12 [DEBUG] (tid 140279272539968) dexed_factory.hh(208): Registered a indexed_factory for class 999 at 0x7f954f458310, indexed_factory #28 for 0x5622d9f10cc0
key-value-store_1     | [Detaching after fork from child process 12]
key-value-store_1     | 402 Objects/methodobject.c: No such file or directory.
key-value-store_1     | 
key-value-store_1     | Program received signal SIGSEGV, Segmentation fault.
key-value-store_1     | 0x00007f9550866138 in cfunction_enter_call (func=0x7f954e1649f0) at Objects/methodobject.c:402
key-value-store_1     | #0  0x00007f9550866138 in cfunction_enter_call (func=0x7f954e1649f0) at Objects/methodobject.c:402
key-value-store_1     | #1  cfunction_vectorcall_O (func=func@entry=0x7f954e1649f0, args=0x7f9550005b08, nargsf=nargsf@entry=1, kwnames=0x0) at Objects/methodobject.c:478
key-value-store_1     | #2  0x00007f9550839e7d in PyVectorcall_Call (callable=0x7f954e1649f0, tuple=<optimized out>, kwargs=<optimized out>) at Objects/call.c:200
key-value-store_1     | #3  0x00007f954f9cbf6b in pybind11::detail::simple_collector<(pybind11::return_value_policy)1>::call (this=0x7ffcff45d948, ptr=0x7f954e1649f0) at /usr/include/pybind11/cast.h:1896
key-value-store_1     | #4  0x00007f954f9be52b in pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::str&> (this=0x7ffcff45e5a0, args#0=...) at /usr/include/pybind11/cast.h:2051
key-value-store_1     | #5  0x00007f954f9b3057 in pybind11::detail::pythonbuf::sync (this=0x7ffcff45e160) at /usr/include/pybind11/iostream.h:45
key-value-store_1     | #6  0x00007f954f9b2f9d in pybind11::detail::pythonbuf::overflow (this=0x7ffcff45e160, c=125) at /usr/include/pybind11/iostream.h:37
key-value-store_1     | #7  0x00007f954f62b9cb in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
key-value-store_1     | #8  0x00007f954f61cb34 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
key-value-store_1     | #9  0x00007f954f855852 in scarab::logger::Private::logCout (this=0x5622d9ec4800, level=scarab::logger::ELevel::eProg, message="Final configuration:\n\n{\n    dripline : \n    {\n        alerts-exchange : alerts\n        heartbeat-interval-s : 60\n        heartbeat-routing-key : heartbeat\n        loop-timeout-ms : 1000\n        max-pa"..., loc=...) at /usr/local/src/scarab/library/logger/logger.cc:175
key-value-store_1     | #10 0x00007f954f8550d1 in scarab::logger::Log (this=0x7f954f918010 <applog>, level=scarab::logger::ELevel::eProg, message="Final configuration:\n\n{\n    dripline : \n    {\n        alerts-exchange : alerts\n        heartbeat-interval-s : 60\n        heartbeat-routing-key : heartbeat\n        loop-timeout-ms : 1000\n        max-pa"..., loc=...) at /usr/local/src/scarab/library/logger/logger.cc:346
key-value-store_1     | #11 0x00007f954f892afb in scarab::main_app::pre_callback (this=0x5622da3c52d0) at /usr/local/src/scarab/library/cli/application.cc:134
key-value-store_1     | #12 0x00007f954f9a7858 in CLI::App::run_callback (this=0x5622da3c5300) at /usr/local/src/scarab/library/external/cli11/CLI11.hpp:5491
key-value-store_1     | #13 0x00007f954f9a5a49 in CLI::App::parse (this=0x5622da3c5300, args=...) at /usr/local/src/scarab/library/external/cli11/CLI11.hpp:5000
key-value-store_1     | #14 0x00007f954f9a597e in CLI::App::parse (this=0x5622da3c5300, argc=4, argv=0x7ffcff45e020) at /usr/local/src/scarab/library/external/cli11/CLI11.hpp:4929