compensating_work_started missing null pointer check - crash

chriskohlhoff / asio

Asio C++ Library

http://think-async.com/Asio

4.96k stars 1.22k forks source link

compensating_work_started missing null pointer check - crash #588

Open rpopescu opened 3 years ago

rpopescu commented 3 years ago

Hi Chris,

I've just run into this issue described here https://github.com/boostorg/asio/issues/150#issuecomment-598456319 and here https://svn.boost.org/trac10/ticket/13562 related to this code: https://github.com/chriskohlhoff/asio/blob/6e75b35cdf5b6195cf7fc6f15d54eb134e6de22c/asio/include/asio/detail/impl/scheduler.ipp#L321

Can you please have a look and advise? Thank you

chriskohlhoff commented 3 years ago

The compensating_work_started function is only called from inside the scheduler where we must have a valid, non-null call stack. That it's null may indicate that something is seriously wrong with the way the program is built. A build issue with shared libraries perhaps?

kashirin-alex commented 3 years ago

If all correct, the check was prior https://github.com/chriskohlhoff/asio/issues/642#issuecomment-752291313

Suppressing it like this will instead introduce a subtle, hard-to-debug work counting problem. (https://github.com/chriskohlhoff/asio/pull/330#issuecomment-752275849)

The GCC is able to recognize the potential null:

/asio/detail/impl/scheduler.ipp:324:3: warning: potential null pointer dereference [-Wnull-dereference]
  324 |   ++static_cast<thread_info*>(this_thread)->private_outstanding_work;
      |   ^

suggestion: 1) to have another contains type function expected with attribute __attribute__ ((returns_nonnull)) https://github.com/chriskohlhoff/asio/blob/0355fc59807198034780c814c799ed353ffe960f/asio/include/asio/detail/call_stack.hpp#L92 2) solving the hard-to-debug case to easiest (as long as cassert valid for the build - OR only for Debug Builds) before private_outstanding_work increment to add a check

ASIO_ASSERT(this_thread);

deepbluev7 commented 3 years ago

I'm running into this issue all the time on arm64, when using the epoll reactor. Something is seriously wrong there, but I can't figure out, where it goes wrong. Here is the handler tracking from the call that fails:

@asio|1615169960.876131|0*1|resolver@0xaaaaf2dc33b8.async_resolve
@asio|1615169961.008663|>1|ec=system:0,...
@asio|1615169961.008803|1*2|strand_executor@0xaaaaf2460b60.execute
@asio|1615169961.008865|1*3|io_context@0xaaaaf161eec0.execute
@asio|1615169961.008906|<1|
@asio|1615169961.008933|>3|
@asio|1615169961.008962|>2|
@asio|1615169961.009218|2^4|in 'async_connect' (/usr/include/boost/asio/impl/connect.hpp:362)
@asio|1615169961.009218|2*4|socket@0xaaaaf2dc3400.async_connect
@asio|1615169961.009479|<2|
@asio|1615169961.009525|<3|
@asio|1615169961.050516|.4|non_blocking_connect,ec=system:0
@asio|1615169961.050679|>4|ec=system:0
@asio|1615169961.050791|4*5|strand_executor@0xaaaaf2460b60.execute
@asio|1615169961.050863|4*6|io_context@0xaaaaf161eec0.execute
@asio|1615169961.050919|<4|
@asio|1615169961.050955|>6|
@asio|1615169961.050974|>5|
@asio|1615169961.056323|5^7|in 'async_write' (/usr/include/boost/asio/impl/write.hpp:331)
@asio|1615169961.056323|5^7|called from 'ssl::stream<>::async_handshake' (/usr/include/boost/asio/ssl/detail/io.hpp:201)
@asio|1615169961.056323|5*7|socket@0xaaaaf2dc3400.async_send
@asio|1615169961.056787|.7|non_blocking_send,ec=system:0,bytes_transferred=517
@asio|1615169961.056975|<5|
@asio|1615169961.056964|>7|ec=system:0,bytes_transferred=517
@asio|1615169961.057012|<6|
@asio|1615169961.057167|7*8|strand_executor@0xaaaaf2460b60.execute
@asio|1615169961.057206|7*9|io_context@0xaaaaf161eec0.execute
@asio|1615169961.057255|<7|
@asio|1615169961.057289|>9|
@asio|1615169961.057311|>8|
@asio|1615169961.057546|8^10|in 'ssl::stream<>::async_handshake' (/usr/include/boost/asio/ssl/detail/io.hpp:168)
@asio|1615169961.057546|8*10|socket@0xaaaaf2dc3400.async_receive
@asio|1615169961.057693|.10|non_blocking_recv,ec=system:11,bytes_transferred=0
@asio|1615169961.057728|<8|
@asio|1615169961.057755|<9|
Abgebrochen (Speicherabzug geschrieben)

It seems like the recv of the ssl handhake fails with connection aborted (if I interpret the 11 correctly) and then when cleaning up after the error it crashes, because this_thread is null? While I can't say, I'm doing everything correctly, this only seems to happen on arm, but it seems to happen reliably there. There is either some data race or some other weirdness going on.

kstolze-ocient commented 3 years ago

I have run across this issue as well. I browsed a bit through the Boost code and have a couple points - just as food for thought:

(1) In the destructor of call_stack::context, we have this code:

    ~context()
    {
      call_stack<Key, Value>::top_ = next_;
    }

This is assuming that the entry for the object being destroyed is always at the top. But if there were code like this:

auto context1 = std::make_unique< call_stack::context >(...);
auto context2 = std::make_unique< call_stack::context >(...);
...
context1.reset();
context2.reset();

that assumption is violated. The destructor for context1 will mess up the call stack. To make this more robust, it may be a good idea to search for the correct enry in all of callstack if `top` isn't the right one.

Note that I haven't seen any code that would do things like that.

(2) Another thing is this:

void epoll_reactor::descriptor_state::do_complete(
    void* owner, operation* base,
    const boost::system::error_code& ec, std::size_t bytes_transferred)
{
[...]
    if (operation* op = descriptor_data->perform_io(events))
[...]
}

owner is the scheduler object on which compensating_work_started() is called later on inside perform_io(), but owner isn't passed to this function as parameter. Instead, it is derived from some member in the descriptor_data object. I haven't pursued this any further, but maybe there is some rare situation in epoll_reactor where descriptor_data->scheduler_ is a nullptr or refers to a different scheduler?

ranjitsinhrathod1991 commented 2 years ago

We are also facing this issue in our project and we found below back trace which seems pretty similar to this issue.

Typical Crash backtrace:

Thread 1 (LWP 907): 0 boost::asio::detail::scheduler::compensating_work_started (this=0x559e174230) at /usr/include/boost/asio/detail/impl/scheduler.ipp:321 1 boost::asio::detail::epoll_reactor::perform_io_cleanup_on_block_exit::~perform_io_cleanup_on_block_exit (this=0x7f861ab348, __in_chrg=) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:712 2 boost::asio::detail::epoll_reactor::descriptor_state::perform_io (events=, this=0x7f800023d0) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:730 3 boost::asio::detail::epoll_reactor::descriptor_state::do_complete (owner=0x559e174230, base=0x7f800023d0, ec=..., bytes_transferred=) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:774 4 0x0000007f894a4398 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=17, ec=..., owner=0x559e174230, this=0x7f800023d0) at /usr/include/boost/asio/detail/scheduler_operation.hpp:40 5 boost::asio::detail::scheduler::do_run_one (ec=..., this_thread=..., lock=..., this=0x559e174230) at /usr/include/boost/asio/detail/impl/scheduler.ipp:447 6 boost::asio::detail::scheduler::run (this=0x559e174230, ec=...) at /usr/include/boost/asio/detail/impl/scheduler.ipp:200 7 0x0000007f895e444c in boost::asio::io_context::run (this=0x559e174b70) at /usr/include/boost/asio/impl/io_context.ipp:63

Adding a NULL check with the pointer like this https://github.com/chriskohlhoff/asio/pull/330/commits/a3afaecc1ef6e2f2a72af18132c1b509cd3ebe5b supposed to solve our issue which was seen very frequently in our project.