Calling `acceptor::cancel` while `async_accept` loop is in progress causes deadlock.

chriskohlhoff / asio

Asio C++ Library

http://think-async.com/Asio

4.97k stars 1.22k forks source link

Calling `acceptor::cancel` while `async_accept` loop is in progress causes deadlock. #806

Open sergio-eld opened 3 years ago

sergio-eld commented 3 years ago

Windows 10, MinGW64 GCC 10.2.0

There is no way to safely stop async_accept with acceptor::cancel. Deadlock is caused either if cancel is called directly or explicitly posted with asio::post(acceptor.get_executor(), [this](){acceptor.cancel();}).

Sample code for testing:

#include <asio.hpp>
#include <iostream>
#include <functional>

constexpr unsigned short port = 12000;
using asio::error_code;
using asio::ip::tcp;

void runContext(asio::io_context& io_context) {
    std::cout << "New thread " << std::this_thread::get_id() << std::endl;
    io_context.run();
    std::cout << "Stopping thread: " << std::this_thread::get_id() << std::endl;
}

class server {
public:
    template <typename Executor>
    explicit server(Executor executor) : acceptor_(executor, {asio::ip::make_address_v4("127.0.0.1"), port}) {
        acceptor_.set_option(tcp::acceptor::reuse_address(true));
    }

    void startAccepting() {
        acceptor_.listen();
        acceptLoop();
    }

    void stop()
    {
        asio::post(acceptor_.get_executor(), [this]() {
            acceptor_.cancel();
        }); // this line fixes deadlock
//         acceptor_.cancel();
        // acceptor_.close(); // this line also fixes deadlock
    }

private:
    void acceptLoop() {
        acceptor_.async_accept([this](error_code errorCode, tcp::socket peer) {
            if (!errorCode) {
                std::cout << "Accept: " << peer.remote_endpoint() << std::endl;
                acceptLoop();
            } else {
                std::cout << "Accept: " << errorCode.message() << std::endl;
            }
        });
    }

    tcp::acceptor acceptor_;
};

int main() {
    setvbuf(stdout, NULL, _IONBF, 0);
    asio::io_context context;

    // run server
    server server{make_strand(context)};
    server.startAccepting();

    // run client
    tcp::socket socket{make_strand(context)};
    std::future<void> res = socket.async_connect({ asio::ip::make_address_v4("127.0.0.1"), port}, asio::use_future);

    std::thread t1(runContext, std::ref(context));
    std::thread t2(runContext, std::ref(context));

//    std::cout << "Awaiting client " << std::endl;

    res.get();

//    std::cout << "Completed client" << std::endl;

    server.stop();

    std::cout << "Server stopped" << std::endl;

    t1.join();
    t2.join();
    std::cout << "Everyting shutdown" << std::endl;
}

I have opened a question on Stackoverflow: https://stackoverflow.com/questions/66953464/asio-getting-deadlock-when-using-several-strands-and-threads-with-one-io-contex

mabrarov commented 3 years ago

Hi.

It looks like the code you provided has a common error, which is usually seen when timers are used. It's a sort of race condition. Here is what may happen (it can be definitely the root cause, but I didn't check yet if it's the only one):

acceptor::async_accept method completed successfully internally - inside OS (i.e. in Windows IOCP) - and the IOCP packet about successful completion of asynchronous accept operation was queued within Windows IOCP queue.
res.get(); line completed.
server::stop method was called, but acceptor::cancel method didn't cancel any operation, because asynchronous operation was actually completed in terms of OS.
Completion handler of acceptor::async_accept method is called with no error passed into it, so that server::acceptLoop() method is called one more time causing new asynchronous operation to start.
t1 and t2 threads wait for completion of that new asynchronous operation.

Unfortunately, acceptor::cancel method doesn't return the number of cancelled operations (like it's implemented for timers) or any over value indicating that some operation was cancelled. I believe this is caused by the lack of support of this information in the underlying API (like Winsock CancelIoEx). If acceptor::cancel method could signal somehow if any operation was actually cancelled or not then it could help for your case.

mabrarov commented 3 years ago

Could you please try this server class implementation in your tests:

class server
{
public:
  template <typename Executor>
  explicit server(Executor executor) : acceptor_(executor,
      {asio::ip::make_address_v4("127.0.0.1"), port}), stopped_(false)
  {
    acceptor_.set_option(tcp::acceptor::reuse_address(true));
  }

  void startAccepting()
  {
    acceptor_.listen();
    acceptLoop();
  }

  void stop()
  {
    asio::post(acceptor_.get_executor(), [this]()
    {
      std::cout << "Stopping server\n";
      acceptor_.cancel();
      stopped_ = true;
      std::cout << "Server stopped\n";
    });
  }

private:
  void acceptLoop()
  {
    acceptor_.async_accept([this](error_code errorCode, tcp::socket peer)
    {
      if (!errorCode)
      {
        std::stringstream ss;
        ss << "Accept: peer " << peer.remote_endpoint() << '\n';
        std::cout << ss.str();
        if (stopped_)
        {
          std::cout << "Too late - the server was stopped\n";
        }
        else
        {
          acceptLoop();
        }
      }
      else
      {
        std::stringstream ss;
        ss << "Accept: error " << errorCode.value() << '\n';
        std::cout << ss.str();
      }
    });
  }

  tcp::acceptor acceptor_;
  bool stopped_;
};

I used the same class in my acceptor_cancel_test/src/main.cpp and was able to reproduce the following cases (the first one is described in my previous comment) on Windows 10 1703 x64 + Visual C++ 2019 + Boost 1.75.0:

New thread: 8924
New thread: 1200
Stopping server
Server stopped
Accept: peer 127.0.0.1:51577
Too late - the server was stopped
Stopping thread: 8924
Stopping thread: 1200
Everything shutdown

and

New thread: 2168
Accept: peer 127.0.0.1:51597
Stopping server
Server stopped
Accept: error 995
Stopping thread: 2168
New thread: 9532
Stopping thread: 9532
Everything shutdown

mabrarov commented 3 years ago

Refer to remark in https://www.boost.org/doc/libs/1_75_0/doc/html/boost_asio/reference/basic_waitable_timer/cancel/overload1.html:

Remarks

If the timer has already expired when cancel() is called, then the handlers for asynchronous wait operations will:

have already been invoked; or

have been queued for invocation in the near future.

These handlers can no longer be cancelled, and therefore are passed an error code that indicates the successful completion of the wait operation.

It seems that the same happens in your case with acceptor::cancel method.

sergio-eld commented 3 years ago

My workaround was exactly the one to use a flag to break the asynchronous loop. The only difference is that I used atomic_bool and spawned a new thread to defer the cancel invocation.

accepting_ = false;
std::thread{[this](){acceptor_.cancel();}}.detach();

I've ran you version 5000 times in a row for the sampled code and haven't got a lock. But I don't see how this would prevent locking in case of multiple connection attempts from clients (in the sampled code there's only one connection) happening simultaneously with the close invocation from within the strand. This is yet to be tested.

Anyhow, this is still an issue. The documentation states that cancel stops outstanding operations and does not mention anything about data race. So: a) The documentation needs to be updated. b) The documentation describes the intended behaviour. cancel needs to be race-free.

mabrarov commented 3 years ago

std::thread{[this](){acceptor_.cancel();}}.detach();

You need to access acceptor_ from a single instance of strand because in your case you have multiple threads working with the same acceptor instance - the main thread, t1 and t2 - and usage of strand can guarantee lack of concurrent access to acceptor_ (asio::ip::tcp::acceptor class is not thread safe).

But I don't see how this would prevent locking in case of multiple connection attempts from clients (in the sampled code there's only one connection) happening simultaneously with the close invocation from within the strand

Sorry, but I failed to understand where the locking you mentioned can happen. Feel free to check ma_echo_server for example of how the stop of the server (including closing of acceptor) could be implemented.

Regarding documentation, I believe that this is mentioned somewhere in general, but it would be better to have an explicit remark for the cancel method of acceptor like it's done for the cancel method of timers. The same needs to be done for the cancel method of sockets.

cancel needs to be race-free

cancel method is race-free. In case of the issue reported here it's user code who has a race condition. It's the nature of asynchrony and Asio - these events are different and happen in causal order (but not at the same):

start (initiation) of asynchronous operation
completion of asynchronous operation (with queuing of completion handler for execution)
invocation of asynchronous operation completion handler (callback or future)

In case socket (acceptor), timer or any other Asio class instance ("I/O object" concept) is modified (calling cancel or close methods) b/w event 2 and event 3, then that won't impact the result passed into completion handler (event 3). If this case needs to be tracked then it's considered responsibility of user code (flags / state machine, like server::stopped_ flag in my modification of originally reported code).

stromnet commented 3 years ago

Hi,

I have a question/inquiry that is very similar, and since this one is still Open I'm posting here instead of creating new post (or SO or similar).

Given this:

io_context has several run() threads
one ip::tcp::socket
one steady_timer
one io_context::strand which both socket and timer is using.

With this connect method (running from within a handler on the strand):

timer_.expires_after(5s);
timer_.async_wait([this](auto ec) {
    if (ec) return;
    log("connection failed");
    socket_.cancel();
});

socket_.async_connect(endpoint, [this, endpoint](auto ec) {
    if (ec) {
        if(ec != asio::error::operation_aborted) {
            log("connection failed {}", ec.what());
        }
        timer_.cancel();
        //... unrelated cleanup & reconnect stuff
        return;
    }

    log("connected");
    // ... setup another timer for read timeouts & do async_read_some etc...
});

In most situations, one of these will happen:

Timer expires, timer handler is enqueued
timer handler cancels the socket.
socket handler is called with operation aborted.

Socket succeeds, socket handler is enqueued
socket handler will cancel the timer, and go on and setup a new timer & socket reader
timer handler is called with operation aborted

What I am unsure of though, is if the strand provides any guarantees about IO completion vs handler execution interleaving? For example, can this still happen?

timer expires, handler is enqueued
socket succeeds/fails, socket handler is enqueued
timer handler executes socket.cancel() (no-op, since no IO is pending, handler already enqueued?)
socket handler executes and handles the real socket error/success.

or, worse:

socket succeeds, socket handler is enqueued
timer expires, timer handler is enqueued
socket handler executes and sets up new async_read.
timer handler executes socket.cancel(), cancelling the async_read.
socket is now dead with no reader...

If the strand does not give any such guarantees I just have to keep additional state variables to avoid this, but if it DOES protect from this, it would be nice to know, to avoid extra code.

Thanks! (And feel free to point me in another direction, but this seemed like a good context)

mabrarov commented 3 years ago

Hi @stromnet,

Strand (neither explicit - using asio::io_context::strand class - nor implicit - a single thread running asio::io_context loop) doesn't guarantee lack of "IO completion vs handler execution interleaving". Strand guarantees lack of concurrent execution of handlers only and execution of handler happens after scheduling (queuing) of handler which happens after completion of asynchronous operation (i.e. asynchronous operation completes, then handler is queued with the result of operation - like std::error_code - bound to the handler, and only after that handler is executed by one of the threads calling poll* and run* methods of asio::io_context class) .

The case you described at the end of your question is the common one. There is a dozen of (closed, because it's not a bug) reports / GitHub issues about this case.

I just have to keep additional state variables to avoid this

Exactly. The reason asio::io_context::strand class doesn't provide this guarantee is simple - it's all about synchronization which is:

different for different cases
not zero cost (very expensive in general)

Welcome to the asynchronous world! :)

stromnet commented 3 years ago

Great, big thanks for a clear and very quick reply! This was exactly what I was assuming. Now I know for sure :)