Open rpopescu opened 3 years ago
The compensating_work_started
function is only called from inside the scheduler where we must have a valid, non-null call stack. That it's null may indicate that something is seriously wrong with the way the program is built. A build issue with shared libraries perhaps?
If all correct, the check was prior https://github.com/chriskohlhoff/asio/issues/642#issuecomment-752291313
Suppressing it like this will instead introduce a subtle, hard-to-debug work counting problem. (https://github.com/chriskohlhoff/asio/pull/330#issuecomment-752275849)
The GCC is able to recognize the potential null:
/asio/detail/impl/scheduler.ipp:324:3: warning: potential null pointer dereference [-Wnull-dereference]
324 | ++static_cast<thread_info*>(this_thread)->private_outstanding_work;
| ^
suggestion:
1)
to have another contains
type function expected
with attribute __attribute__ ((returns_nonnull))
https://github.com/chriskohlhoff/asio/blob/0355fc59807198034780c814c799ed353ffe960f/asio/include/asio/detail/call_stack.hpp#L92
2) solving the hard-to-debug case to easiest (as long as cassert valid for the build - OR only for Debug Builds)
before private_outstanding_work increment to add a check
ASIO_ASSERT(this_thread);
I'm running into this issue all the time on arm64, when using the epoll reactor. Something is seriously wrong there, but I can't figure out, where it goes wrong. Here is the handler tracking from the call that fails:
@asio|1615169960.876131|0*1|resolver@0xaaaaf2dc33b8.async_resolve
@asio|1615169961.008663|>1|ec=system:0,...
@asio|1615169961.008803|1*2|strand_executor@0xaaaaf2460b60.execute
@asio|1615169961.008865|1*3|io_context@0xaaaaf161eec0.execute
@asio|1615169961.008906|<1|
@asio|1615169961.008933|>3|
@asio|1615169961.008962|>2|
@asio|1615169961.009218|2^4|in 'async_connect' (/usr/include/boost/asio/impl/connect.hpp:362)
@asio|1615169961.009218|2*4|socket@0xaaaaf2dc3400.async_connect
@asio|1615169961.009479|<2|
@asio|1615169961.009525|<3|
@asio|1615169961.050516|.4|non_blocking_connect,ec=system:0
@asio|1615169961.050679|>4|ec=system:0
@asio|1615169961.050791|4*5|strand_executor@0xaaaaf2460b60.execute
@asio|1615169961.050863|4*6|io_context@0xaaaaf161eec0.execute
@asio|1615169961.050919|<4|
@asio|1615169961.050955|>6|
@asio|1615169961.050974|>5|
@asio|1615169961.056323|5^7|in 'async_write' (/usr/include/boost/asio/impl/write.hpp:331)
@asio|1615169961.056323|5^7|called from 'ssl::stream<>::async_handshake' (/usr/include/boost/asio/ssl/detail/io.hpp:201)
@asio|1615169961.056323|5*7|socket@0xaaaaf2dc3400.async_send
@asio|1615169961.056787|.7|non_blocking_send,ec=system:0,bytes_transferred=517
@asio|1615169961.056975|<5|
@asio|1615169961.056964|>7|ec=system:0,bytes_transferred=517
@asio|1615169961.057012|<6|
@asio|1615169961.057167|7*8|strand_executor@0xaaaaf2460b60.execute
@asio|1615169961.057206|7*9|io_context@0xaaaaf161eec0.execute
@asio|1615169961.057255|<7|
@asio|1615169961.057289|>9|
@asio|1615169961.057311|>8|
@asio|1615169961.057546|8^10|in 'ssl::stream<>::async_handshake' (/usr/include/boost/asio/ssl/detail/io.hpp:168)
@asio|1615169961.057546|8*10|socket@0xaaaaf2dc3400.async_receive
@asio|1615169961.057693|.10|non_blocking_recv,ec=system:11,bytes_transferred=0
@asio|1615169961.057728|<8|
@asio|1615169961.057755|<9|
Abgebrochen (Speicherabzug geschrieben)
It seems like the recv of the ssl handhake fails with connection aborted (if I interpret the 11 correctly) and then when cleaning up after the error it crashes, because this_thread is null? While I can't say, I'm doing everything correctly, this only seems to happen on arm, but it seems to happen reliably there. There is either some data race or some other weirdness going on.
I have run across this issue as well. I browsed a bit through the Boost code and have a couple points - just as food for thought:
(1) In the destructor of call_stack::context
, we have this code:
~context()
{
call_stack<Key, Value>::top_ = next_;
}
This is assuming that the entry for the object being destroyed is always at the top. But if there were code like this:
auto context1 = std::make_unique< call_stack::context >(...);
auto context2 = std::make_unique< call_stack::context >(...);
...
context1.reset();
context2.reset();
that assumption is violated. The destructor for context1
will mess up the call stack. To make this more robust, it may be a good idea to search for the correct enry in all of callstack if `top` isn't the right one.
Note that I haven't seen any code that would do things like that.
(2) Another thing is this:
void epoll_reactor::descriptor_state::do_complete(
void* owner, operation* base,
const boost::system::error_code& ec, std::size_t bytes_transferred)
{
[...]
if (operation* op = descriptor_data->perform_io(events))
[...]
}
owner
is the scheduler object on which compensating_work_started()
is called later on inside perform_io()
, but owner
isn't passed to this function as parameter. Instead, it is derived from some member in the descriptor_data
object. I haven't pursued this any further, but maybe there is some rare situation in epoll_reactor
where descriptor_data->scheduler_
is a nullptr or refers to a different scheduler?
We are also facing this issue in our project and we found below back trace which seems pretty similar to this issue.
Typical Crash backtrace:
Thread 1 (LWP 907): 0 boost::asio::detail::scheduler::compensating_work_started (this=0x559e174230) at /usr/include/boost/asio/detail/impl/scheduler.ipp:321 1 boost::asio::detail::epoll_reactor::perform_io_cleanup_on_block_exit::~perform_io_cleanup_on_block_exit (this=0x7f861ab348, __in_chrg=
) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:712 2 boost::asio::detail::epoll_reactor::descriptor_state::perform_io (events= , this=0x7f800023d0) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:730 3 boost::asio::detail::epoll_reactor::descriptor_state::do_complete (owner=0x559e174230, base=0x7f800023d0, ec=..., bytes_transferred= ) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:774 4 0x0000007f894a4398 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=17, ec=..., owner=0x559e174230, this=0x7f800023d0) at /usr/include/boost/asio/detail/scheduler_operation.hpp:40 5 boost::asio::detail::scheduler::do_run_one (ec=..., this_thread=..., lock=..., this=0x559e174230) at /usr/include/boost/asio/detail/impl/scheduler.ipp:447 6 boost::asio::detail::scheduler::run (this=0x559e174230, ec=...) at /usr/include/boost/asio/detail/impl/scheduler.ipp:200 7 0x0000007f895e444c in boost::asio::io_context::run (this=0x559e174b70) at /usr/include/boost/asio/impl/io_context.ipp:63
Adding a NULL check with the pointer like this https://github.com/chriskohlhoff/asio/pull/330/commits/a3afaecc1ef6e2f2a72af18132c1b509cd3ebe5b supposed to solve our issue which was seen very frequently in our project.
Hi Chris,
I've just run into this issue described here https://github.com/boostorg/asio/issues/150#issuecomment-598456319 and here https://svn.boost.org/trac10/ticket/13562 related to this code: https://github.com/chriskohlhoff/asio/blob/6e75b35cdf5b6195cf7fc6f15d54eb134e6de22c/asio/include/asio/detail/impl/scheduler.ipp#L321
Can you please have a look and advise? Thank you