SocketAsyncEngine.Unix perf experiments

tmds commented 4 years ago

I'm looking into ways that may improve SocketAsyncEngine.Unix performance.

Based on kestrel-linux-transport implementation, this is what I'm thinking of:

Batch receives and sends on the epoll thread using Linux AIO. Note that this isn't the POSIX AIO library. The functions called here map directly to syscalls.

and

class SocketAsyncEventArgs
{
  public bool RunContinuationsAsynchronously { get; set; } = true;
  public bool PreferSynchronousCompletion { get; set; } = true;
}

Setting RunContinuationsAsynchronously to false allows SocketAsyncEngine to invoke callbacks directly from the epoll thread. This avoid context switching cost.
Setting PreferSynchronousCompletion to false allows SocketAsyncEngine to not try a synchronous attempt and go async immediately. An implementation could set this to true for sends, and false for receives. Or for receives it could be changed based on whether the last receive filled up the buffer completely. Going async immediately saves an attempt syscall and allows batching.

The defaults match the current behavior.

io_uring is also an interesting option to explore. I'm not looking into that atm because the features we need are in 5.5 kernel which was released only a couple of days ago.

I'm going to do a poc in a separate repo for benchmarking. Anyone interested is free to review PRs, or fork and do some experimentation. When I have a working implementation, I'll need some help with benchmarking and see what is the change in perf.

cc @stephentoub @geoffkizer @davidfowl @halter73 @adamsitnik @benaadams @VSadov @damageboy @lpereira @dotnet/ncl

stephentoub commented 4 years ago

Thanks, Tom.

because the features we need are in 5.5 kernel which was released only a couple of days ago

Realistically, what does that mean in terms of when a) a typical developer/app would have the features available, and b) a motivated developer/app would have the features available? If io_uring made a substantial difference, seems like something we could do proactively and then have folks that really wanted the boost patch their configurations?

lpereira commented 4 years ago

@tmds Have you seen the https://github.com/tkp1n/IoUring project? It's a transport similar to k-l-t, but using io_uring. It's being actively developed by a single person, but it's worth running a round of benchmarks with it in the Citrine machine just to have a feel of how much io_uring would make things better. I'll talk with @sebastienros today to arrange this.

lpereira commented 4 years ago

Thanks, Tom.

because the features we need are in 5.5 kernel which was released only a couple of days ago

Realistically, what does that mean in terms of when a) a typical developer/app would have the features available, and b) a motivated developer/app would have the features available? If io_uring made a substantial difference, seems like something we could do proactively and then have folks that really wanted the boost patch their configurations?

I'm very positive it'll make a substantial difference. And, you're right, using io_uring will require some changes that users will need to know before they can enjoy the difference. Mainly, a new-ish kernel (5.5+, potentially 5.6+ when it releases), and some changes in the security limits (mainly the amount of pages a user process can lock).

We can also substitute (well, add and detect if it's available at runtime) the usage of epoll in pal_networking.c and use io_uring as an epoll substitute. This should significantly reduce the syscall chatter there. (Other uses of poll(), like synchmgr and pal_io could move to io_uring, too. We might consider having a single ring for all these file descriptors that are watched and have these implementations use that instead of calling poll()/epoll() directly, too, hiding all of these platform-specific stuff in a single place. (With time, other things could be changed to use io_uring: there are plenty of other operations that it can perform, including accepting sockets, connecting to a remote host, opening files, all asynchronously.)

wfurt commented 4 years ago

cc: @scalablecory as he was experimenting with io_uring as well.

benaadams commented 4 years ago

because the features we need are in 5.5 kernel which was released only a couple of days ago

Realistically, what does that mean in terms of when a) a typical developer/app would have the features available, and b) a motivated developer/app would have the features available?

Don't know about other distros, but Ubuntu 20.04 will be the next LTS, and is due to be released on April 23, 2020, it will have the 5.5 or later kernel and is currently in their daily builds.

tmds commented 4 years ago

Don't know about other distros, but Ubuntu 20.04 will be the next LTS, and is due to be released on April 23, 2020, it will have the 5.5 or later kernel and is currently in their daily builds.

Probably not everyone is changing to a new LTS release when it comes out, but many keep on using the one they are at.

I think the Linux AIO option is worth exploring because it will reduce syscalls significantly by batching on the event loop thread, just like io_uring, but doesn't require a recent kernel.

io_uring is worth exploring also.

@tmds Have you seen the https://github.com/tkp1n/IoUring project? It's a transport similar to k-l-t, but using io_uring. It's being actively developed by a single person, but it's worth running a round of benchmarks with it in the Citrine machine just to have a feel of how much io_uring would make things better. I'll talk with @sebastienros today to arrange this.

Yes, I'm following repo (which takes some good bits from k-l-t). And also interested to see benchmark results.

We can also substitute (well, add and detect if it's available at runtime) the usage of epoll in pal_networking.c and use io_uring as an epoll substitute. This should significantly reduce the syscall chatter there.

The syscall chatter for epoll in pal_networking.c is actually surprisingly low because EPOLLET is used. (In retrospect I should have done the same thing in k-l-t.) So advantageous of io_uring need to come from running operations on it.

benaadams commented 4 years ago

Probably not everyone is changing to a new LTS release when it comes out, but many keep on using the one they are at.

Yes, but the LTS release is the low bar where it becomes easier to adopt (other than the previous LTS moving out of support)

If LTS OS is out in April with io_uring and .NET 5 is out in Nov, I'd hope it would use io_uring; rather than waiting for the next .NET release a year later.

tmds commented 4 years ago

Yes, but the LTS release is the low bar where it becomes easier to adopt (other than the previous LTS moving out of support)

Ah, yes, if you want it, it's available in an LTS release.

If LTS OS is out in April with io_uring and .NET 5 is out in Nov, I'd hope it would use io_uring; rather than waiting for the next .NET release a year later.

Are you assuming io_uring will perform significantly better than Linux AIO for batching?

tmds commented 4 years ago

Are you assuming io_uring will perform significantly better than Linux AIO for batching?

I'm assuming it will be slightly better, but maybe not even measurable. We need to measure to know.

Any feedback on the proposed properties? These apply also for io_uring backend. PreferSynchronousCompletion needs to be false for maximum batching.

lpereira commented 4 years ago

To be honest with you, Linux AIO is a disaster, for all sorts of different reasons. I wouldn't even consider supporting it in the first place if there's an alternative like io_uring.

benaadams commented 4 years ago

I'd hope it would perform better in general as it was meant to overcome some of the issues with AIO https://kernel.dk/io_uring.pdf; and could also be used for file IO etc.

However; as you say just because io_uring would become more generally available, AIO would have wider support and the linux transport does perform better than Net.Sockets. So if there is the resource to do both...

Any feedback on the proposed properties?

Windows doesn't queue completions to the ThreadPool (which is why Kestrel has the additional ThreadPool queue for Windows but not on Linux). Is this because multiple threads can listen to the completion, while on Linux only one thread can listen so it needs to be more available to listen, whereas on Windows another thread can pick up the next events if one gets blocked?

tmds commented 4 years ago

Windows doesn't queue completions to the ThreadPool (which is why Kestrel has the additional ThreadPool queue for Windows but not on Linux). Is this because multiple threads can listen to the completion, while on Linux only one thread can listen so it needs to be more available to listen, whereas on Windows another thread can pick up the next events if one gets blocked?

epoll_wait can be called from multiple threads (events go in the argument). io_uring can be called only from one thread (events are written to completion queue).

tmds commented 4 years ago

io_uring can be called only from one thread (events are written to completion queue).

And because requests are in a shared memory request queue, all requests need to come from that thread too. This is different from epoll, epoll_ctl can be called from any thread.

tmds commented 4 years ago

Is this because multiple threads can listen to the completion,

On Windows, what happens if multiple completions are blocking these threads? Are additional threads created?

benaadams commented 4 years ago

On Windows, what happens if multiple completions are blocking these threads? Are additional threads created?

I believe; but could be wrong, it essentially has 2 thread pools, one for IO and one that is the ThreadPool. All the IO threads listen to a single completion port, which is FIFO so the hottest thread is reactivated first; however it only gets one event at a time, rather than batching https://github.com/dotnet/runtime/issues/11314, which also helps if it blocks; but also increases syscalls

halter73 commented 4 years ago

I'm not the expert, but my understanding is the same as @benaadams that the IOCP thread pool grows as IOCP threads become blocked. This function certainly implies that it does:

https://github.com/dotnet/runtime/blob/3a457cb4b552d9b32fbf844389ad2a08bcb2a7a6/src/coreclr/src/vm/win32threadpool.cpp#L3781

sebastienros commented 4 years ago

uname -a
Linux asp-perf-lin 5.0.0-37-generic #40~18.04.1-Ubuntu

All numbers are the average of 3 runs. Not using the CITRINE machines (TE) as they are being moved.

Plaintext

| Description |       RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | Build Time (ms) | Published Size (KB) | First Request (ms) | Latency (ms) | Errors | Ratio |
| ----------- | --------- | ------- | ----------- | ----------------- | ------------ | --------------- | ------------------- | ------------------ | ------------ | ------ | ----- |
|     Sockets | 3,806,103 |      99 |          48 |              1.02 |          521 |            2001 |                   0 |              54.67 |         0.51 |      0 |  1.00 |
|     IoUring | 4,919,463 |      92 |          49 |              3.32 |          519 |            2001 |                   0 |              61.25 |         0.42 |      0 |  1.29 |
|        RHTX | 4,960,781 |      91 |          51 |              2.86 |          529 |            2001 |                   0 |              57.76 |         0.33 |      0 |  1.30 |

Both IoUring and RHTX are limited by the client at that point. Using two machines got IoUring to 5.6M and RHTX to 5.9M both at 100% cpu.

Json

| Description |     RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | Build Time (ms) | Published Size (KB) | First Request (ms) | Latency (ms) | Errors | Ratio |
| ----------- | ------- | ------- | ----------- | ----------------- | ------------ | --------------- | ------------------- | ------------------ | ------------ | ------ | ----- |
|     Sockets | 415,085 |      99 |         144 |              0.99 |          520 |            2001 |                   0 |               62.9 |         0.59 |      0 |  1.00 |
|     IoUring | 636,819 |      99 |         146 |              1.06 |          534 |            4502 |                   0 |              66.35 |         0.32 |      0 |  1.53 |
|        RHTX | 632,451 |      99 |         147 |              1.07 |          531 |            2001 |                   0 |              67.86 |         0.38 |      0 |  1.52 |

Here again using two client machines got IoUring to 789K and RHTX to 786K.

@lpereira has some remarks to share about the version of the Kernel this is running on. I will update this machine to a more recent one and remeasure.

stephentoub commented 4 years ago

Any feedback on the proposed properties?

For prototyping purposes, do whatever helps you come to a proposal fastest.

But for production, I suspect neither of those will be the right things to add. There are several issues. First, most devs aren't going to be (and shouldn't be) targeting SAEA directly; they'll be awaiting {Value}Task-based methods, which might use SAEA under the covers but which wouldn't expose those knobs, and getting perf gains here shouldn't restrict you to using a much more complicated programming model. Second, I'm not sure there's any setting that would make sense across platforms. On Windows, we always do the equivalent of completing synchronously whenever possible and running continuations synchronously, and there's not really a good reason to want to choose something else. Exposing such properties would likely make things worse for libraries/apps written to be largely platform agnostic. It's possible we could come up with a meaningful design, but I'm a little skeptical.

My expectation is we'll land in a place where:

we improve perf by default, but maybe not as much as it could otherwise be if we weren't concerned about safety/reliability
for maximum throughput, likely at the expense of reliability in the face of blocking or unbalanced workloads, there is some knob that can be set, likely globally, that opts-in to the relevant behaviors

I'm happy to be proven wrong if we find a way to have our cake and eat it, too.

benaadams commented 4 years ago

Is there scope to have ThreadPool an instance type (using min/max IO threads as its values); so to create a second one to mimic Windows behaviour?

e.g. Have minimal poll threads (batch ops to reduce syscalls); then queue to an IO ThreadPool (low queuing contention, not contending with user threads, blocking in a callback on a thread doesn't block the whole batch)

tmds commented 4 years ago

Is there scope to have ThreadPool an instance type (using min/max IO threads as its values); so to create a second one to mimic Windows behaviour?

I don't plan to go this far. RunContinuationsAsynchronously=false, will make it possible to try this out.

adamsitnik commented 4 years ago

I am also working on improving the Sockets performance on Linux. My plan is to get the best of available and stable Linux APIs: epoll and AIO (I know that they suck in many ways, but we have to live with that) and then give io_uring a try.

Below you can see a work of a typical thread pool thread from JSON TechEmpower benchmark:

obraz

The actual JSON serialization is... 3% of the total time. 33% of time is spent for libc_sendmsg and libc_recvmsg.

Most of the TechEmpower benchmarks are simply writing and reading very small chunks of bytes to|from sockets. IMHO batching these reads and writes can give the best possible gain. This is why it's the next thing that I am going to implement. @tmds please let me know if your work on this is already advanced and I should not be duplicating the effort.

I also plan and study trace files from real-life workloads, which potentially could expose other issues. @kevingosse if I remember correctly you have run into some issues with socket perf at Criteo. Is there any chance that you could share some thoughts|stats|trace files?

stephentoub commented 4 years ago

The actual JSON serialization is... 3% of the total time.

Yes. The irrelevance of the JSON serialization in this benchmark is highlighted by the actual benchmark results. Number 33 at https://www.techempower.com/benchmarks/#section=data-r18&hw=ph&test=json is ASP.NET Core using Tom's Kestrel transport that uses an affinitized epoll thread per core, at 94% RPS of the top result, and number 60 in the results is ASP.NET Core using the standard sockets transport, at 71% RPS of the top result... that's a gap entirely due to the transport.

lpereira commented 4 years ago

Batching writes can be beneficial, yes. One of the things that helped me push Lwan (current run shows it at #5 in the TWFB for the JSON benchmark; used to be #1 with a good margin a few years back) a little bit further was to avoid using vectored I/O to send the response. If the buffer containing the response headers has space to also contain the response, the response is copied that and it's then sent without a vectored write. IIRC I got ~10% increase in throughput with that.

I don't know, however, about batching reads on non-pipelined requests (and I wouldn't even care about pipelined requests because most clients don't). How would that work?

I agree with @stephentoub and @adamsitnik here, though. In most real-life scenarios, you won't be serializing JSON like that; you'll probably fetch stuff from a database, perform some calculation, read stuff from files and massage the data, etc.; so this benchmark is pretty much "do some very basic computation just so it's not writing a constant response while stress-testing the transport".

tmds commented 4 years ago

I am also working on improving the Sockets performance on Linux. My plan is to get the best of available and stable Linux APIs: epoll and AIO (I know that they suck in many ways, but we have to live with that) and then give io_uring a try.

@adamsitnik this is what I'm trying to do also, and I created this issue to have some awareness, and get some involvement from others. Should I stop spending time on it? Or do you want to work on this together?

stephentoub commented 4 years ago

Should I stop spending time on it? Or do you want to work on this together?

Please work on it together :smile:

benaadams commented 4 years ago

Are read and write serial events; or can they be made parallel (i.e. read from one socket and write to another at same time)? Or is that just how its showing up in the flamegraph (not being a timeline)?

tmds commented 4 years ago

Or is that just how its showing up in the flamegraph (not being a timeline)?

It's the flamegraph. It's sampled stacktraces, and the most called* child function is put to the left below its parent.

* according to the samples, not actual calls

kevingosse commented 4 years ago

@kevingosse if I remember correctly you have run into some issues with socket perf at Criteo. Is there any chance that you could share some thoughts|stats|trace files?

It was more of a scheduling issue. In an application that favors latency over throughput, once a request has started being processed, you want every available resource to be allocated to that request. This is something achieved nicely by the threadpool local queues (continuations for a given request are enqueued to the local queue of the thread. Assuming that you have no blocking or overly long tasks, it means that a thread that has started processing a request will focus in priority on the continuations of that same request instead of starting to process new ones). On Linux, the socket continuation is enqueued to the global queue, which means they have a lower priority than continuations and the same priority as new requests (enqueued to the global queue by the HTTP server). We end up in a situation where, instead of having a low median response time and a very high p99, we have a higher and flatter line across the percentiles and most requests end up in timeout.

I've tried to reproduce this in a test application, but with mitigated results so far. I must be missing something in the picture because following @stephentoub's suggestion, I've run some tests with @VSadov's threadpool which in theory fixes the fairness issue, but I saw no improvement.

My fix has been to push the socket continuations to a custom threadpool, somewhat mimicking what's happening on Windows with the I/O threadpool. However, that custom threadpool isn't quite as finely tuned/optimized as the "normal" one, so it degrades the performance on the TechEmpower benchmark, and so I've never tried to push those changes upstream.

Following our discussion in Warsaw, I've come to realize that epoll and iocp weren't that different, and the heuristics of the I/O threadpool could possibly work with epoll. I've been studying the subject in details over the last weeks with the hope of writing a prototype, but I've got nothing to show yet. Even though you're focusing on the overhead of sockets rather than the scheduling, I believe our work will converge at some point.

That said, if you want some traces to check how sockets use the CPU in our workloads, I should be able to provide that. I'll try to get them in the coming days.

tmds commented 4 years ago

Tmds.LinuxAsync has an implementation for epoll+Linux AIO. I've just added an io_uring based implementation also (with some TODOs). There are some issues with the benchmark infrastructure, so no results to share yet.

stephentoub commented 4 years ago

Sounds great, @tmds. Thanks for the update. Looking forward to results when you've got 'em.

tmds commented 4 years ago

hi, though it has been quit here, we (@antonfirsov, @adamsitnik, @tkp1n and myself) have been working on this, I'd like to share some intermediate conclusions:

Batching operations that are happen when epoll says they are ready using AIO has a measurable improvement on json benchmark
Zero-byte async receives, used by Kestrel to wait until data is available without allocating a buffer, are not batcheable with Linux AIO.
Setting RunContinuationsAsynchronously to false provides a way to make Kestrel receive loop run on io thread, so receives will be batched and don't have to be tried synchronously.
Kestrel sends never start on io thread (epoll/io_uring). Moving sends to io thread (PreferSynchronousCompletion=false) for batching regressed performance.
io_uring implementation is hitting a bottleneck which prevents it from using all CPU. This leads to io_uring performing worse than AIO. We expect that without bottleneck io_uring would perform better. From traces it seems the bottleneck is not in .NET implementation.

kevingosse commented 4 years ago

I was reminded of this thread thanks to Stephen's talk at dotnetos yesterday 😄

I'm wondering why (seemingly) nobody has tried mimicking the Windows I/O threadpool on Linux (this was suggested by Ben Adams as well in this thread I believe).

Unless I'm missing a fundamental detail, epoll's behavior is very close to IOCP if used with the EPOLLEXCLUSIVE flag. We could have an I/O threadpool on Linux using the same growing heuristics as on Windows, waiting on epoll, and inlining the callback. This way we avoid the hop to the threadpool and we keep the ability to scale up when a continuation hijacks the thread to execute a blocking operation.

epoll can't limit the number of active threads like IOCP does, but this can probably be emulated somehow.

Would it make sense for me to spend time prototyping this, or am I missing something?

stephentoub commented 4 years ago

I was reminded of this thread thanks to Stephen's talk at dotnetos yesterday

Thanks for watching. Hope you enjoyed it.

Would it make sense for me to spend time prototyping this, or am I missing something?

You're of course welcome to, and it'd be interesting to see what results you get. Thanks for offering.

However, I'm a bit skeptical you'll see wins. In particular, even if you solve the scaling mechanism, you'd likely need to change the buffer size passed to epoll wait. Today, a single wait can receive up to 1024 events. You'd likely need to make that 1, or else a single wait could get multiple events to be processed, and inlining of the processing of the Nth event could result in stalling all of those received in the batch after that. That would mean you might find yourself making 1024 epoll wait calls where previously you were only making 1, and I'd expect (though could be wrong) that's going to dwarf costs associated with the shuffle to the thread pool.

Note that for .NET 5, we changed how that handoff to the thread pool works, making it much more efficient, to the point where it's now competitive with @tmds's transport that does maintain one thread per core, inlines all processing, etc.

tmds commented 4 years ago

Note that for .NET 5, we changed how that handoff to the thread pool works, making it much more efficient,

This was implemented in https://github.com/dotnet/runtime/pull/35330. Instead of dispatching each event from epoll separately, a batch of event that come from the epoll are now dispatched to the threadpool together.

I'm going to close this because it is no longer tracking anything on-going from my side.

dotnet / runtime

SocketAsyncEngine.Unix perf experiments #14304