Open sergio-eld opened 3 years ago
Hi.
It looks like the code you provided has a common error, which is usually seen when timers are used. It's a sort of race condition. Here is what may happen (it can be definitely the root cause, but I didn't check yet if it's the only one):
acceptor::async_accept
method completed successfully internally - inside OS (i.e. in Windows IOCP) - and the IOCP packet about successful completion of asynchronous accept operation was queued within Windows IOCP queue.res.get();
line completed.server::stop
method was called, but acceptor::cancel
method didn't cancel any operation, because asynchronous operation was actually completed in terms of OS.acceptor::async_accept
method is called with no error passed into it, so that server::acceptLoop()
method is called one more time causing new asynchronous operation to start.t1
and t2
threads wait for completion of that new asynchronous operation.Unfortunately, acceptor::cancel
method doesn't return the number of cancelled operations (like it's implemented for timers) or any over value indicating that some operation was cancelled. I believe this is caused by the lack of support of this information in the underlying API (like Winsock CancelIoEx
). If acceptor::cancel
method could signal somehow if any operation was actually cancelled or not then it could help for your case.
Could you please try this server
class implementation in your tests:
class server
{
public:
template <typename Executor>
explicit server(Executor executor) : acceptor_(executor,
{asio::ip::make_address_v4("127.0.0.1"), port}), stopped_(false)
{
acceptor_.set_option(tcp::acceptor::reuse_address(true));
}
void startAccepting()
{
acceptor_.listen();
acceptLoop();
}
void stop()
{
asio::post(acceptor_.get_executor(), [this]()
{
std::cout << "Stopping server\n";
acceptor_.cancel();
stopped_ = true;
std::cout << "Server stopped\n";
});
}
private:
void acceptLoop()
{
acceptor_.async_accept([this](error_code errorCode, tcp::socket peer)
{
if (!errorCode)
{
std::stringstream ss;
ss << "Accept: peer " << peer.remote_endpoint() << '\n';
std::cout << ss.str();
if (stopped_)
{
std::cout << "Too late - the server was stopped\n";
}
else
{
acceptLoop();
}
}
else
{
std::stringstream ss;
ss << "Accept: error " << errorCode.value() << '\n';
std::cout << ss.str();
}
});
}
tcp::acceptor acceptor_;
bool stopped_;
};
?
I used the same class in my acceptor_cancel_test/src/main.cpp and was able to reproduce the following cases (the first one is described in my previous comment) on Windows 10 1703 x64 + Visual C++ 2019 + Boost 1.75.0:
New thread: 8924
New thread: 1200
Stopping server
Server stopped
Accept: peer 127.0.0.1:51577
Too late - the server was stopped
Stopping thread: 8924
Stopping thread: 1200
Everything shutdown
and
New thread: 2168
Accept: peer 127.0.0.1:51597
Stopping server
Server stopped
Accept: error 995
Stopping thread: 2168
New thread: 9532
Stopping thread: 9532
Everything shutdown
Refer to remark in https://www.boost.org/doc/libs/1_75_0/doc/html/boost_asio/reference/basic_waitable_timer/cancel/overload1.html:
Remarks
If the timer has already expired when
cancel()
is called, then the handlers for asynchronous wait operations will:
- have already been invoked; or
- have been queued for invocation in the near future.
These handlers can no longer be cancelled, and therefore are passed an error code that indicates the successful completion of the wait operation.
It seems that the same happens in your case with acceptor::cancel
method.
My workaround was exactly the one to use a flag to break the asynchronous loop. The only difference is that I used atomic_bool
and spawned a new thread to defer the cancel
invocation.
accepting_ = false;
std::thread{[this](){acceptor_.cancel();}}.detach();
I've ran you version 5000 times in a row for the sampled code and haven't got a lock. But I don't see how this would prevent locking in case of multiple connection attempts from clients (in the sampled code there's only one connection) happening simultaneously with the close
invocation from within the strand. This is yet to be tested.
Anyhow, this is still an issue. The documentation states that cancel
stops outstanding operations and does not mention anything about data race. So:
a) The documentation needs to be updated.
b) The documentation describes the intended behaviour. cancel
needs to be race-free.
Hi
std::thread{[this](){acceptor_.cancel();}}.detach();
You need to access acceptor_
from a single instance of strand
because in your case you have multiple threads working with the same acceptor
instance - the main thread, t1
and t2
- and usage of strand
can guarantee lack of concurrent access to acceptor_
(asio::ip::tcp::acceptor
class is not thread safe).
But I don't see how this would prevent locking in case of multiple connection attempts from clients (in the sampled code there's only one connection) happening simultaneously with the close invocation from within the strand
Sorry, but I failed to understand where the locking you mentioned can happen. Feel free to check ma_echo_server for example of how the stop of the server (including closing of acceptor) could be implemented.
Regarding documentation, I believe that this is mentioned somewhere in general, but it would be better to have an explicit remark for the cancel
method of acceptor
like it's done for the cancel
method of timers. The same needs to be done for the cancel
method of sockets.
cancel
needs to be race-free
cancel
method is race-free. In case of the issue reported here it's user code who has a race condition. It's the nature of asynchrony and Asio - these events are different and happen in causal order (but not at the same):
In case socket (acceptor), timer or any other Asio class instance ("I/O object" concept) is modified (calling cancel
or close
methods) b/w event 2 and event 3, then that won't impact the result passed into completion handler (event 3). If this case needs to be tracked then it's considered responsibility of user code (flags / state machine, like server::stopped_
flag in my modification of originally reported code).
Hi,
I have a question/inquiry that is very similar, and since this one is still Open I'm posting here instead of creating new post (or SO or similar).
Given this:
With this connect method (running from within a handler on the strand):
timer_.expires_after(5s);
timer_.async_wait([this](auto ec) {
if (ec) return;
log("connection failed");
socket_.cancel();
});
socket_.async_connect(endpoint, [this, endpoint](auto ec) {
if (ec) {
if(ec != asio::error::operation_aborted) {
log("connection failed {}", ec.what());
}
timer_.cancel();
//... unrelated cleanup & reconnect stuff
return;
}
log("connected");
// ... setup another timer for read timeouts & do async_read_some etc...
});
In most situations, one of these will happen:
or
What I am unsure of though, is if the strand provides any guarantees about IO completion vs handler execution interleaving? For example, can this still happen?
or, worse:
If the strand does not give any such guarantees I just have to keep additional state variables to avoid this, but if it DOES protect from this, it would be nice to know, to avoid extra code.
Thanks! (And feel free to point me in another direction, but this seemed like a good context)
Hi @stromnet,
Strand (neither explicit - using asio::io_context::strand
class - nor implicit - a single thread running asio::io_context
loop) doesn't guarantee lack of "IO completion vs handler execution interleaving". Strand guarantees lack of concurrent execution of handlers only and execution of handler happens after scheduling (queuing) of handler which happens after completion of asynchronous operation (i.e. asynchronous operation completes, then handler is queued with the result of operation - like std::error_code
- bound to the handler, and only after that handler is executed by one of the threads calling poll*
and run*
methods of asio::io_context
class) .
The case you described at the end of your question is the common one. There is a dozen of (closed, because it's not a bug) reports / GitHub issues about this case.
I just have to keep additional state variables to avoid this
Exactly. The reason asio::io_context::strand
class doesn't provide this guarantee is simple - it's all about synchronization which is:
Welcome to the asynchronous world! :)
Great, big thanks for a clear and very quick reply! This was exactly what I was assuming. Now I know for sure :)
Windows 10, MinGW64 GCC 10.2.0
There is no way to safely stop
async_accept
withacceptor::cancel
. Deadlock is caused either ifcancel
is called directly or explicitly posted withasio::post(acceptor.get_executor(), [this](){acceptor.cancel();})
.Sample code for testing:
I have opened a question on Stackoverflow: https://stackoverflow.com/questions/66953464/asio-getting-deadlock-when-using-several-strands-and-threads-with-one-io-contex