Accepting sockets asynchronously apparently adversely affects future synchronous IO performance on some platforms

acmorrow commented 7 years ago

I'm filing this issue perhaps prematurely, because I do not yet have a reduced reproduction. I plan to work on one, but I wanted to put this up first in case this was a known issue or is somehow expected behavior.

The setup: The system was originally written such that there was a TCP/IPv4 accept socket, and new connections were accepted by a dedicated thread (the "listen" thread), that would do blocking calls to acceptor::accept. Once a connection was accepted, the socket for the new connection was handed to a new dedicated thread for that connection, which proceeded to do blocking network IO on the socket, and the listen thread would return to its blocking call to accept.

There was, however, a problem with this model, since there is no portable way to unblock a blocking call to accept. Calling close from another thread is famously unsafe, and calling shutdown has unspecified semantics.

To address this, the accept path was re-written to have a dedicated thread doing non-blocking accepts via an io_service object and calls to acceptor::async_accept. After connections were accepted, they were handed off to the same backend, with a new thread for the connection doing blocking network IO.

This fixed the issue with unblocking from accept, since we could now cancel the call to async_accept safely. However, our CI system flagged the commit as causing a significant performance regression in some of our benchmarks; in some cases, as high as 30%.

This was somewhat shocking, because the benchmark in question simply opens a fixed number of connections, waits for them all to be accepted, and then drives load over the already accepted connections. It does not create new connections during the test, and measures throughput only during the portion of the test during which all connections are active.

The upshot is that it seems that the mechanism of accepting a socket affects the future performance of the accepted socket. A socket accepted via the non-blocking acceptor::async_accept performs much worse than a socket accepted via acceptor::accept.

We are using ASIO 230c0d2ae035c5ce1292233fcab03cea0d341264. The system where this is being tested is a CentOS 6 Linux instance.

Note also that I've had no luck reproducing these results on my local workstation, running Ubuntu 16.04, but the change in behavior is very observable on our performance benchmark system. Switching the accept mechanism back to a blocking call to acceptor::accept definitively restores performance on the benchmarking hardware.

I realize this is not the most actionable bug report, and I'm hoping to provide more detail in the future, but I would be interested to hear any thoughts. This issue is potentially blocking the rollout of a new ASIO based ingress networking layer in MongoDB.

arun11299 commented 7 years ago

I will just make few guesses, in case it hits somewhere close to home.

async APIs would make use of dynamic memory allocation. Maybe you would want to try to implement custom memory allocation using asio_handler_allocate.
Is your io_service shared between threads ? You might not want to do that.
Related to previous point, use of io_service::poll / io_service::poll_one would be better than calling it's run function.
Any issues in your performance system with thread pinning to a CPU core? And also, how many connections per second are we talking about here ?

acmorrow commented 7 years ago

async APIs would make use of dynamic memory allocation. Maybe you would want to try to implement custom memory allocation using asio_handler_allocate.

Well, we are using gperftools, so allocations are implicitly per-thread. And remember, only the accept part of the system is varying here, so I don't see this as likely.

Is your io_service shared between threads ? You might not want to do that.

Yes, unavoidably so. You can't move ASIO sockets between io_service objects, and in both modes, the socket is being accepted in one thread, but then handed off to another thread communicate with the client. So implicitly, the io_service to which the socket is bound is shared between those threads. This is also the architecture that we want, since our longer term goal is to multiplex threads over connections via the io_service and a non-blocking IO layer.

Related to previous point, use of io_service::poll / io_service::poll_one would be better than calling it's run function.

That would be a busy wait? I'm not sure why that would be better.

Any issues in your performance system with thread pinning to a CPU core? And also, how many connections per second are we talking about here ?

The test was run with 1, 2, 4, and 8 connected clients. In all cases, the client threads were bound on one socket with numactl --cpunodebind=0, and the server threads were bound on another socket with numactl -cpunodebind=1.

In terms of connections/s, the rate of new connection establishment during the test is 0. The test creates 1, 2, 4, or 8 initial connections, and then drives load over them for the duration of the test. During that time, no new connections are accepted. This is why this performance regression is so puzzling. We aren't event executing the code path that varies during the performance test, yet we observe it to have a significant effect.

arun11299 commented 7 years ago

The test creates 1, 2, 4, or 8 initial connections, and then drives load over them for the duration of the test. During that time, no new connections are accepted.

Ah...that is really puzzling. Perf et all didn't show anything different ?

acmorrow commented 7 years ago

My original hope had been to repro the performance delta on my local dev workstation, and then do as you suggest re perf to dig into what was different. Unfortunately, it doesn't seem to repro on my workstation.

jbreams commented 7 years ago

I think I've figured out what's going on here. The perf regression occurs when there are more client threads than available cores, and I noticed that sockets accepted by async_accept get registered with epoll for EPOLLIN even if no async operations are ever performed on them. Because of this, the "listener thread" which is calling async_accept() in a loop wakes up every time data is received on sockets, even though the recv's are all being done synchronously. Making the registration with epoll happen on the first call to epoll_reactor::start_op() recovered all our lost performance.

acmorrow commented 7 years ago

@chriskohlhoff - If you could please give this a look I'd appreciate it. It is currently a blocker for our rolling out an ASIO based component in our networking stack.

chriskohlhoff commented 7 years ago

Please try using a separate io_context for your synchronous-only sockets (i.e. just the one context shared between all such sockets), and never call run() on it. You can specify an alternate context when accepting the connection: acc.async_accept(other_ctx, ...)

jbreams commented 7 years ago

Using a separate io_context solved our performance problem, but the fix feels like it's covering up a bug or at least inconsistent behavior. All of the other reactors lazily register their sockets and don't have this problem. At the very least there should be a big warning on the async_accept method warning that accepted sockets will be registered with epoll immediately and your performance will suffer dramatically if you only do synchronous operations. Accepting onto another io_context and never calling run() on it also seems a bit hacky - why should sockets be registered with epoll if we never intend to run epoll_wait()?

I've still got a pull request open (https://github.com/chriskohlhoff/asio/pull/242) to make the behavior of epoll consistent with the other reactors, and I'd still like to get it merged.

Also, the async_accept methods that take an io_context as an argument are not documented outside of the headers - I think maybe the headers and the documentation on the website are just out of sync?

acmorrow commented 7 years ago

@chriskohlhoff - Thanks for getting back to us with that suggestion. We have implemented it, and while it has fixed our performance regression which is much appreciated, I do agree wholeheartedly with @jbreams per his request to merge his fix. Requiring a separate io_context to avoid this behavior is very unfortunate.

First, it violates the principle of least surprise - what we had been doing is certainly a legitimate use of the API, and it cost us considerable time and effort to debug and understand the severe performance degradation.

Second, since it is not possible to portably interrupt a blocking accept, I expect that it will be extremely common in practice to use an asynchronous model for accepting new connections, but there are many valid scenarios for using a thread-per-connection or thread-per-request model once the connection has been accepted. It should not be the case that by using an async accept, the generated sockets are "damaged" in a performance sense when they are intended to be used in a blocking fashion.

Finally, I agree with Jonathan's observation that if the other reactor forms do lazy association with the underlying OS primitives, then the epoll reactor should as well.

So I would like to push for getting #242 merged - it really seems like the right thing to do, it will remove a subtle performance pitfall, it makes the behavior more consistent across the various reactor implementations, and it makes the obvious solution the right solution.

acmorrow commented 7 years ago

Hi @chriskohlhoff - Any additional thoughts on this? We really struggled with this and while we've implemented your indicated solution, we feel strongly that our proposed resolution is superior.

chriskohlhoff / asio

Accepting sockets asynchronously apparently adversely affects future synchronous IO performance on some platforms #240