[Linux] evaluate io_uring() as alternative async I/o mechanism

hassila commented 3 years ago

Stumbled across another (new) async I/o interface that seems worthwhile considering for best possible performance on Linux. See some background here:

https://lwn.net/Articles/810414/ https://kernel.dk/io_uring.pdf

And one simple test vs epoll with performance numbers: https://github.com/frevib/io_uring-echo-server/blob/io-uring-op-provide-buffers/benchmarks/benchmarks.md

Is this something of interest?

hassila commented 3 years ago

Here are some results from netty who adopted it: https://github.com/netty/netty/issues/10622#issuecomment-701241587

PeterAdams-A commented 3 years ago

Thanks for the heads up @hassila. io_uring has been on our radar for a while. Unfortunately it's yet to reach the top of our list as we are busy with many other things. We agree that it is likely to lead to improved performance on linux.

If you feel so inclined we'd welcome any code contributions in this direction.

hassila commented 3 years ago

Thanks @PeterAdams-A - I could have a look, it seems Ubuntu 20.04 has a fresh enough kernel that supports it (even if performance of io_uring seems to have improved quite a bit further with more recent ones than 5.4 that Ubuntu uses, but it should be usable for bringup).

Just a few basic initial questions (I'm sure I'd need some handholding on best integration with existing code-base):

Looking at current #if usage, it seems that it is done on a platform rather than a feature basis - I'm used to (from previous life) to use e.g. ifdefs that checks for e.g. kqueue/epoll rather than OS. In this case, I don't think a PR should replace the current epoll implementation for multiple reasons (primarily support for older kernels on e.g. Amazon Linux 2 etc that lacks io_uring, but also eases of comparing performance between epoll/io_uring). What would be a best practice for supporting two implementations for a single platform and to be able to select between them?
The inventor of io_uring also provides a convenience library (https://github.com/axboe/liburing) that removes a significant chunk of boilerplate - is this something that would be acceptable as a dependency / part of implementation, or would a "clean" use of the lower-level primitives (but with possibly harder-to-read code) be preferred for a successful PR?
Would any of the existing tests be viewed as usable for validating performance, or how would you propose validation is done in an acceptable manner?

I'm sure there'll be additional questions, but just some stuff that came to mind when looking at it.

Lukasa commented 3 years ago

What would be a best practice for supporting two implementations for a single platform and to be able to select between them?

Selecting between them should just be configuration. They can hook in underneath SelectableEventLoop, I suspect. Swift does not have a preprocessor in the traditional sense so it cannot use #if-based feature detection.

The inventor of io_uring also provides a convenience library (https://github.com/axboe/liburing) that removes a significant chunk of boilerplate

There's no problem with relying on this library iff its absence can be tolerated on systems that do not have iouring.

Would any of the existing tests be viewed as usable for validating performance, or how would you propose validation is done in an acceptable manner?

We have an existing performance test suite, and it should be possible to run that against any new implementation.

hassila commented 3 years ago

Exposing io_uring should probably be done in the context of https://github.com/apple/swift-system ?

It mentions SwiftNIO as one consumer, but doesn't seem like it is currently used - ok to add it as a dependency?

Lukasa commented 3 years ago

It is not ok to add it as a dependency, unfortunately. Until it tags 1.0 adding it as a dependency will immediately cause dependency hell if anyone else in the ecosystem depends on it.

We expose our system calls through our own internal abstraction layer: https://github.com/apple/swift-nio/blob/main/Sources/NIO/Linux.swift.

hassila commented 3 years ago

Ok, that's alright. I'll just make shims and interfaces analogous to what's already in place then.

hassila commented 3 years ago

The inventor of io_uring also provides a convenience library (https://github.com/axboe/liburing) that removes a significant chunk of boilerplate

There's no problem with relying on this library iff its absence can be tolerated on systems that do not have iouring.

Having spent some time now looking at how Swift interacts with C and what support there is in the SPM - I haven't figured out any good way to do that - except manually loading it with dlopen() and using dlsym() to resolve relevant symbols. Is that an acceptable approach (works for me)? I plan on having the implementation fall back to epoll with a warning if io_uring is specified but not available.

Lukasa commented 3 years ago

Yeah, I think we'd have to use dlopen and dlsym. @weissi do you have any objections to that approach?

weissi commented 3 years ago

I'm fine with dlopen/dlsym or invoking the syscalls directly using syscall (returns ENOSYS if unimplemented).

Is the plan to ship the relevant io_uring headers (for struct io_uring_params/struct io_sqring_offsets/struct io_cqring_offsets) in NIO itself?

The other question is if we target liburing or the io_uring_* syscalls directly. Netty targets io_uring_* directly.

Lukasa commented 3 years ago

@weissi See the discussion above. I think targeting liburing is sensible.

hassila commented 3 years ago

Agree, it seems that liburing (also written by Jens Axboe) removes a reasonable amount of boilerplate that otherwise would be reimplemented as discussed above.

With regard to the headers, I had planned on including them such as: // Pull in io_uring if it's availble (Linux kernel 5.1+ systems, better performance with 5.11+) #if __has_include(<liburing/io_uring.h>) #include <liburing/io_uring.h> #endif

This would require liburing to be installed on a developers build system if liburing is to be used - I did not plan to pull them into NIO. If built without liburing headers, liburing support would not be compiled in and NIO would fall back on using epoll as today.

This would fundamentally bind a compiled binary to have a header version matching the shared library installed for runtime.

I guess the alternative would be to pull in liburing as a git submodule (with the caveat it may only build on newer linux systems... needs to be checked what happens) and effectively bundle both headers and .so - but I think it is probably nicer to try to support the version installed on the production/deployments hosts?

Another alternative would of course be to write all the boilerplate, fundamentally duplicating parts of liburing.

Lukasa commented 3 years ago

I think as a first pass we can require that to get a NIO that supports io_uring you need to build on a system with liburing development headers. If we find that this leads to unsatisfactory outcomes we can always change the behaviour later.

hassila commented 3 years ago

Ok, some initial work done:

Build with liburing (headers + dynamic loading in place) if available, exposing in CNIOLinux.h (implemented in shims.c)
Run into slight problem that some of the API of liburing is defined as static inline functions in the header (so cannot dlsym those), but so far all of them seems to resolve down to atomics which should be available at link time and I believe we can call the functions directly without link issues (stubbing those for builds without liburing). If any of those actually require liburing symbols it would push us to require LD_PRELOADING:ng liburing, but that's a problem for another day right now.

Now that the API is accessible a few questions starts to popup, especially I'm trying to figure out the cleanest way to integrate this that would be acceptable.

I would expect that different EventLoops should be able to choose whether to use epoll or liburing? As liburing can optionally be setup with submission queue polling (IORING_SETUP_SQPOLL) through a kernel-side thread (that lives for x ms) for high-throughput scenarios (significantly reducing the need for actual syscalls) - this basically trades of CPU for better latency/throughput. It's also possible to configure both submission/completion queue sizes independently (not uncommon to require a larger completion queue according to presentations from Jens). For the users who want to use io_uring, it seems reasonable to open up for tuning these options on a per-uring basis (we would use one io_uring instance per Eventloop), as one may want to make one uring instance polling while others not.

Any Channel associated with a liburing enabled EventLoop needs to use liburing primitives also for submitting work.

So trying to break down some actual questions:

Where should I expose configuration of whether to use liburing instead of epoll? SelectableEventLoop extended initialiser? (SelectableEventLoop can then have conditional use of liburing if enabled for the completion handling side)
If so, should the actual uring options be part of that initialiser also (poll mode, submission/completion queue sizes)?
I must admit I am a bit unclear of the best way to push in the submission side and would appreciate guidance - should I conditionalize the calls in BSDSocketAPIPosix.swift to use uring instead of syscalls? Then I'd need to add a flag to NIOBSDSocket to know whether to use uring. Not sure yet if/where NIOBSDSocket is set up for the Channel, but if the Channel is associated with an EventLoop with uring enabled it would need to be set for the socket in that case.

Apologies in advance for a bit "fluffy" questions, still wrapping my head around the codebase.

weissi commented 3 years ago

First of all, thank you so much for driving this forward, this is super valuable! Leaving some comments inline.

Run into slight problem that some of the API of liburing is defined as static inline functions in the header (so cannot dlsym those), but so far all of them seems to resolve down to atomics which should be available at link time

But static inline functions won't necessarily have any symbols at all. They may only exist either as inlined pieces of code. If Swift's clang importer can't import these, then we'd need to "wrap" them in a regular C function and we could stub them out if uring isn't available (exactly as you suggest below).

Now that the API is accessible a few questions starts to popup, especially I'm trying to figure out the cleanest way to integrate this that would be acceptable.

I would expect that different EventLoops should be able to choose whether to use epoll or liburing? As liburing can optionally be setup with submission queue polling (IORING_SETUP_SQPOLL) through a kernel-side thread (that lives for x ms) for high-throughput scenarios (significantly reducing the need for actual syscalls) - this basically trades of CPU for better latency/throughput. It's also possible to configure both submission/completion queue sizes independently (not uncommon to require a larger completion queue according to presentations from Jens). For the users who want to use io_uring, it seems reasonable to open up for tuning these options on a per-uring basis (we would use one io_uring instance per Eventloop), as one may want to make one uring instance polling while others not.

So I'd do the configuration on the EventLoopGroup. If somebody wants N eventloops where say one EventLoop has a different config, then they can just start two EventLoopGroups, one with just 1 EL and another one with N-1. Do you think that'd be good enough?

Any Channel associated with a liburing enabled EventLoop needs to use liburing primitives also for submitting work.

👍

So trying to break down some actual questions:

Where should I expose configuration of whether to use liburing instead of epoll? SelectableEventLoop extended initialiser? (SelectableEventLoop can then have conditional use of liburing if enabled for the completion handling side)

Good question. I'm not even 100% sure you would even want to use SelectableEventLoop at all. SelectableEventLoop isn't actually public API so we could dynamically use a URingEventLoop or so. I haven't spent a lot of time thinking about this so I think you'd be in a better place to decide.

Aside: There are a few places in NIO itself where we expect SelectableEventLoop but I reckon that could be changed to either expect SelectableEventLoop or URingEventLoop.

If so, should the actual uring options be part of that initialiser also (poll mode, submission/completion queue sizes)?

I must admit I am a bit unclear of the best way to push in the submission side and would appreciate guidance - should I conditionalize the calls in BSDSocketAPIPosix.swift to use uring instead of syscalls? Then I'd need to add a flag to NIOBSDSocket to know whether to use uring. Not sure yet if/where NIOBSDSocket is set up for the Channel, but if the Channel is associated with an EventLoop with uring enabled it would need to be set for the socket in that case.

NIOBSDSocket is mostly to name-space lots of socket stuff (like socket options, the right types int vs. SOCKET (Windows) etc) that was introduced to make Windows interop easier. The actual sockets are owned by the Socket class. The instantiation of those is currently done in the bootstraps.

Apologies in advance for a bit "fluffy" questions, still wrapping my head around the codebase.

Not at all, please feel free to ask questions. If you're in some of the Slack/Discord channels, we're also available there for a higher throughput comms channel :)

hassila commented 3 years ago

So I'd do the configuration on the EventLoopGroup. If somebody wants N eventloops where say one EventLoop has a different config, then they can just start two EventLoopGroups, one with just 1 EL and another one with N-1. Do you think that'd be good enough?

Yeah, I think that'd be alright.

So trying to break down some actual questions:

Where should I expose configuration of whether to use liburing instead of epoll? SelectableEventLoop extended initialiser? (SelectableEventLoop can then have conditional use of liburing if enabled for the completion handling side)

Good question. I'm not even 100% sure you would even want to use SelectableEventLoop at all. SelectableEventLoop isn't actually public API so we could dynamically use a URingEventLoop or so. I haven't spent a lot of time thinking about this so I think you'd be in a better place to decide.

Aside: There are a few places in NIO itself where we expect SelectableEventLoop but I reckon that could be changed to either expect SelectableEventLoop or URingEventLoop.

I think maybe going after the Selector instead may be the right level, will follow up in slack instead as you suggested, just joined there now.

NIOBSDSocket is mostly to name-space lots of socket stuff (like socket options, the right types int vs. SOCKET (Windows) etc) that was introduced to make Windows interop easier. The actual sockets are owned by the Socket class. The instantiation of those is currently done in the bootstraps.

I'll focus on getting the epoll/kqueue replacement up first before looking at outbound as a POC.

Apologies in advance for a bit "fluffy" questions, still wrapping my head around the codebase.

Not at all, please feel free to ask questions. If you're in some of the Slack/Discord channels, we're also available there for a higher throughput comms channel :)

Super, thanks, will follow up there!

fabianfett commented 3 years ago

Where should I expose configuration of whether to use liburing instead of epoll? SelectableEventLoop extended initialiser? (SelectableEventLoop can then have conditional use of liburing if enabled for the completion handling side)

Good question. I'm not even 100% sure you would even want to use SelectableEventLoop at all. SelectableEventLoop isn't actually public API so we could dynamically use a URingEventLoop or so. I haven't spent a lot of time thinking about this so I think you'd be in a better place to decide.

Aside: There are a few places in NIO itself where we expect SelectableEventLoop but I reckon that could be changed to either expect SelectableEventLoop or URingEventLoop.

For what it's worth: The mother?/sister? project Netty uses an IOUringEventLoop.

hassila commented 3 years ago

Short status update of current WIP.

Have done a (partly) functional bringup using liburing, with basic sample apps working ok (ie. NIOUDPEchoServer, NIOChatClient, NIOChatServer, NIOMulticastChat, NIOPerformanceTester, NIOHTTP1Server, NIOHTTP1Client, NIOEchoClient, NIOUDPEchoClient, NIOEchoServer), now working through unit tests and integration tests as time allows.

There are still a lot of cleanup to be done and a little bit of refactoring (and tuning), but basically I've just hooked into the SEL by providing a custom Selector class for uring. NIOPerformanceTester results looks promising compared to epoll but I don't want to put out any numbers until further along. If anyone that has a good performance benchmark outside of what NIO provides I'd be interested in help testing later on (prerequisites Ubuntu 20.04, update kernel to 5.11 and install liburing from github).

This approach just gives 'io_uring lite' as we still have the same conceptual "poll fds, then read when data available" - instead of fully leveraging it with "read all fds, then process data as it arrives". io_uring has a clever mechanism where one can register a pool of io buffers with the kernel, so even if doing 1M reads on different fds, we don't need to back every read up with a buffer - the kernel will let us know which one was used from the pool and we can reuse it. Quite nice. (more details, there's support for different pools with different buffer sizes too, so we can get clever and switch between pools also similar to how different allocation sizes of the receiving buffer seems to be done now). But all of that is for later, it requires more in-depth understanding of NIO and I'd need to frame some more concrete questions after further analysis on how that could proceed. In that case I think we would likely go for a completely different URingEventLoop also.

weissi commented 3 years ago

@hassila fantastic, that's amazing progress, thank you so much! I think it's totally reasonable to start using io_uring just as the eventing mechanism (io_uring lite as you call it) and later on we can switch all syscalls to their io_uring equivalent. That'd be super cool for NonBlockingFileIO too!

hassila commented 3 years ago

There'll be a small pause on progress here, I need to spend some time on testing https://github.com/axboe/liburing/issues/310 which was added right now - this will give much better impedance matching to how SwiftNIO work with uring and eliminate a significant amount of unnecessary re-registrations that the current implementation require. I intend to use this new uring functionality when it is working, so uring support will eventually require a 5.13+ kernel (which is fine I think, older systems will be able to use epoll of course).

hassila commented 3 years ago

Ok, short update of current status:

I've migrated the implementation to instad use multishot polls / updates from https://github.com/axboe/liburing/issues/310 (so all tests are running on a 5.12rc3+multishot polls kernel, will be some time before it is mainstream).

Current results from 'swift tests' compared to a clean clone of SwiftNIO running with epoll:

epoll from mainline:

Test Suite 'All tests' failed at 2021-03-24 15:20:14.133
     Executed 1311 tests, with 2 failures (0 unexpected) in 81.388 (81.388) seconds

io_uring development branch:

Test Suite 'All tests' failed at 2021-03-24 15:35:49.859
     Executed 1305 tests, with 2 failures (0 unexpected) in 107.069 (107.069) seconds

uring runtime is longer due to tons of debug logging which is still in place so don’t care too much about the numbers as of yet.

There are a handful of tests that I’ve disabled that I may want to discuss how to progress with later when the code been cleaned up enough for draft PR (and one of the tests is missing as I'm lagging main with a few revisions):

SelectorTest.testWeDoNotDeliverEventsForPreviouslyClosedChannels
StreamChannelTest.testWritabilityStartsTrueGoesFalseAndBackToTrue
StreamChannelTest.testHalfCloseOwnOutput
StreamChannelTest.testLotsOfWritesWhilstOtherSideNotReading
NIOHTTP1TestServerTest.testConcurrentRequests
(NonBlockingFileIOTest.testThrowsErrorOnUnstartedPool not run yet)

For the scripts/integration_tests.sh status is fairly decent, everything passes except one of the malloc tests (test_1000_udpconnections) which is currently disabled.

> scripts/integration_tests.sh
Running test suite 'tests_01_http'
Running test 'test_01_get_file.sh'... OK (3s)
Running test 'test_02_get_random_bytes.sh'... OK (2s)
Running test 'test_03_post_random_bytes.sh'... OK (3s)
Running test 'test_04_keep_alive_works.sh'... OK (1s)
Running test 'test_05_repeated_reqs_work.sh'... OK (2s)
Running test 'test_06_http_1.0.sh'... OK (1s)
Running test 'test_07_headers_work.sh'... OK (1s)
Running test 'test_08_survive_signals.sh'... OK (1s)
Running test 'test_09_curl_happy_with_trailers.sh'... OK (2s)
Running test 'test_10_connection_drop_in_body_ok.sh'... OK (1s)
Running test 'test_11_res_body_streaming.sh'... OK (5s)
Running test 'test_12_headers_too_large.sh'... OK (1s)
Running test 'test_13_http_pipelining.sh'... OK (3s)
Running test 'test_14_strict_mode_assertion.sh'... OK (2s)
Running test 'test_15_post_in_chunked_encoding.sh'... OK (2s)
Running test 'test_16_tcp_client_ip.sh'... OK (1s)
Running test 'test_17_serve_massive_sparse_file.sh'... OK (20s)
Running test 'test_18_close_with_no_keepalive.sh'... OK (2s)
Running test 'test_19_connection_drop_while_waiting_for_response_uds.sh'... OK (2s)
Running test 'test_20_connection_drop_while_waiting_for_response_tcp.sh'... OK (1s)
Running test 'test_21_connection_reset_tcp.sh'... OK (2s)
Running test 'test_22_http_1.0_keep_alive.sh'... OK (1s)
Running test 'test_23_repeated_reqs_with_half_closure.sh'... OK (7s)
Running test 'test_24_http_over_stdio.sh'... OK (0s)
Running test suite 'tests_02_syscall_wrappers'
Running test 'test_01_syscall_wrapper_fast.sh'... OK (12s)
Running test 'test_02_unacceptable_errnos.sh'... OK (4s)
Running test suite 'tests_03_debug_binary_checks'
Running test 'test_01_check_we_do_not_link_Foundation.sh'... OK (1s)
Running test 'test_02_expected_crashes_work.sh'... OK (2s)
Running test suite 'tests_04_performance'
Running test 'test_01_allocation_counts.sh'... 
...
OK (219s)
Running test suite 'tests_05_assertions'
Running test 'test_01_syscall_wrapper_fast.sh'... OK (0s)
OK (ran 30 tests successfully)

Next step is do clean up and refactor the implementation just a tad before being able to do a draft PR to discuss around the final failing tests.

Many thanks to Johannes Weiß for helping out getting a workable toolchain up and to Cory Benfield for helping out with some roadblocks today with the SAL especially, much appreciated.

weissi commented 3 years ago

@hassila Wow, this is super impressive! Great work.

hassila commented 3 years ago

Added https://github.com/apple/swift-nio/pull/1788 to be able to discuss a) current approach (split of Selectors) and remaining failing cases.

hassila commented 3 years ago

One more question with a performance aspect, I also see this pattern fairly in a few tests (e.g. EchoServerClientTest.testPortNumbers) which all occurs in the same event loop tick before calling whenReady again:

S [NIOThread(actualName = NIO-ELT-179-#0)] register interested SelectorEventSet(rawValue: 1) uringEventSet [24]
S [NIOThread(actualName = NIO-ELT-179-#0)] Re-register old SelectorEventSet(rawValue: 1) new SelectorEventSet(rawValue: 3) uringEventSet [24] reg.uringEventSet [8216]
S [NIOThread(actualName = NIO-ELT-179-#0)] Re-register old SelectorEventSet(rawValue: 3) new SelectorEventSet(rawValue: 7) uringEventSet [8216] reg.uringEventSet [8217]

This translates to Registration of reset Change it to reset+readEOF And then change it to reset+readEOF+read

Is there a good reason for this pattern and is it common?

If so, would it be a possibility to elide all changes to registrations and only perform the low level kqueue/uring/epoll changes just in the beginning of whenRead again instead of doing them inline with the actual call? I'm thinking that given that it all happens in the same tick, it would semantically be the same, but it would allow us to make all the changes with one systemcall instead of three. So basically just keep track of the pending changes to the registration and at the top of whenReady execute them before reaping new events. (it would be extra attractive for uring which basically can merge all system calls for all different fd:s if done that way...)

hassila commented 3 years ago

Here is another example from EchoServerClientTest.testConnectingToIPv4And6ButServerOnlyWaitsOnIPv4 where it moves over 4 different stages during the same tick (going back to the original one) - I understand the need for changing the actual registration, but it seems eliding the change of the external (kqueue/poll/uring) registration could be quite beneficial if these are representative for real-world uses.

S [NIOThread(actualName = NIO-ELT-181-#0)] whenReady.blockUntilTimeout
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_wait_cqe_timeout.ETIME milliseconds __kernel_timespec(tv_sec: 0, tv_nsec: 248841960)
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_wait_cqe_timeout CQE:s [0x00007fb29cc8edd0] - ring flags are [0]
L [NIOThread(actualName = NIO-ELT-181-#0)] 0 = fd[17] eventType[Optional(NIO.CqeEventType.poll)] res [4] flags [2]  bitpattern[Optional(0x0000000100000011)]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_wait_cqe_timeout fd[17] eventType[Optional(NIO.CqeEventType.poll)] bitPattern[4294967313] cqes[0]!.pointee.res[4]
S [NIOThread(actualName = NIO-ELT-181-#0)] We found a registration for event.fd [17]
S [NIOThread(actualName = NIO-ELT-181-#0)] selectorEvent [SelectorEventSet(rawValue: 8)] registration.interested [SelectorEventSet(rawValue: 9)]
S [NIOThread(actualName = NIO-ELT-181-#0)] intersection [SelectorEventSet(rawValue: 8)]
S [NIOThread(actualName = NIO-ELT-181-#0)] running body [NIOThread(actualName = NIO-ELT-181-#0)] SelectorEventSet(rawValue: 8) SelectorEventSet(rawValue: 8)
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_poll_update fd[17] oldPollmask[28] newPollmask[28]  userBitpatternAsPointer[Optional(0x0000000200000011)]
S [NIOThread(actualName = NIO-ELT-181-#0)] Re-register old SelectorEventSet(rawValue: 9) new SelectorEventSet(rawValue: 11) uringEventSet [28] reg.uringEventSet [8220]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_poll_update fd[17] oldPollmask[28] newPollmask[8220]  userBitpatternAsPointer[Optional(0x0000000200000011)]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush io_uring_submit needed [1] submission(s), submitted [2] SQE:s out of [2] possible
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush done
S [NIOThread(actualName = NIO-ELT-181-#0)] Re-register old SelectorEventSet(rawValue: 11) new SelectorEventSet(rawValue: 15) uringEventSet [8220] reg.uringEventSet [8221]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_poll_update fd[17] oldPollmask[8220] newPollmask[8221]  userBitpatternAsPointer[Optional(0x0000000200000011)]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush io_uring_submit needed [1] submission(s), submitted [1] SQE:s out of [1] possible
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush done
S [NIOThread(actualName = NIO-ELT-181-#0)] Re-register old SelectorEventSet(rawValue: 15) new SelectorEventSet(rawValue: 7) uringEventSet [8221] reg.uringEventSet [8217]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_poll_update fd[17] oldPollmask[8221] newPollmask[8217]  userBitpatternAsPointer[Optional(0x0000000200000011)]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush io_uring_submit needed [1] submission(s), submitted [1] SQE:s out of [1] possible
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush done
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush done
S [NIOThread(actualName = NIO-ELT-181-#0)] whenReady.block

7 =  read  + reset + readEOF
9 =  write + reset
11 = write + reset + readEOF
15 = write + reset + readEOF + read
7  = read  + reset + readEOF

For uring, it could be beneficial anyway even if not super common as we would only have the single sys call for all 'external' registration modifications and not one per fd as for epoll/kqueue.

Lukasa commented 3 years ago

Yup, this is expected behaviour. We can break out the registrations/deregistrations and talk about when they happen, which can help shed some light on how this happens.

reset. This registration is always active, as it is fundamentally unmaskable on Linux.
readEOF. This registration is made when the channel becomes active, and is only removed when the channel actually reads EOF. If the channel never hits readEOF for some reason then the registration will never be removed, as it will be implicitly removed when the channel is unregistered from the selector.
write. This registration happens whenever we get EAGAIN on a write, and will persist until the Channel write buffer is drained (note that this is the Channel write buffer, not the kernel write buffer). This will tend to flap on and off, but not as fast as...
read. This registration is managed carefully because it interacts with NIO's backpressure mechanism. In particular, the way NIO backpressure works is that NIO assumes that all channels do not want to read until they ask to, and then that they only want to read once. This is a "pull"-based method of exerting backpressure, rather than a "push"-based one. Pull-based methods are substantially easier to understand, generally speaking. In some cases this can make read appear to flap. Now, read should not flap in most cases because autoRead will usually prevent it flapping, but it can't guarantee it.

Note that nowhere in here does this say the registrations have to actually happen when the channel asks for them. A selector is free to "batch" them up so long as it never delivers an event that the Channel didn't ask for. The contract is, if the Channel hasn't registered for it, then the event must not be delivered to it. So long as the iouring based selector obeys that contact, it can turn this into syscalls however it wants.

hassila commented 3 years ago

Thanks @Lukasa - that is what I hoped, I will try to batch and elide all changes done to registrations then.

A bigger discussion is ET vs LT (as we touched upon earlier), as the current non-exhaustive reads (and writes?) basically forces the implementation to prod the kernel unnecessarily. If it was possible to understand from a given registration if exhaustive reads/writes have/haven't been done, we could avoid that prod which would make a major difference.

Lukasa commented 3 years ago

In this instance I think rather than do that work we should think to the long term, which is an event loop that submits the work to io_uring directly rather than using it as a glorified epoll. :wink:

hassila commented 3 years ago

That's fair enough, the additional prodding currently seem to make it perform less well than epoll (although if https://github.com/axboe/liburing/issues/321 turns up it may turn the tables, as we then would presumably skip receiving CQE:s for the prods that doesn't generate any data).

Fundamentally things are getting fairly close to working (so we at least have infrastructure to go forward with a proper EL), but it seems a proper uring el will be needed to really make a difference.

I'd want to get the glorified epoll up and running and completing all tests first though...

Any pointers on appropriate place of trying to hook in for read/write would be appreciated - basically we need to manage a set of read and write buffers on the EL level which can only be reused / free:d when getting the CQE:s later on. So ownership of the memory must be by the EL.

Lukasa commented 3 years ago

I think practically speaking we'll need to diverge away from SelectableEventLoop entirely: the mechanism used by io_uring is incompatible with the SelectableEventLoop model where the various channels own their I/O operations.

normanmaurer commented 3 years ago

@Lukasa not sure if this is true... we use io_uring in a very similar way as we use epoll etc in netty so I guess you could do the same. We basically still use non-blocking fd's etc. Maybe this is helpful:

https://github.com/netty/netty-incubator-transport-io_uring/blob/main/src/main/java/io/netty/incubator/channel/uring/IOUringEventLoop.java

Lukasa commented 3 years ago

@normanmaurer and I chatted offline. TL;DR: he agrees with me. :wink:

normanmaurer commented 3 years ago

haha yeah @Lukasa is right.. I misunderstood him.. That said the link I provided is still useful I think. Also how we use the submission and completion queues:

https://github.com/netty/netty-incubator-transport-io_uring/blob/main/src/main/java/io/netty/incubator/channel/uring/IOUringSubmissionQueue.java https://github.com/netty/netty-incubator-transport-io_uring/blob/main/src/main/java/io/netty/incubator/channel/uring/IOUringCompletionQueue.java

hassila commented 3 years ago

Ok, updated current state in https://github.com/apple/swift-nio/pull/1788#issuecomment-810393146 - please check if you want anything salvaged there.

hassila commented 3 years ago

New PR in https://github.com/apple/swift-nio/pull/1804.

hassila commented 3 years ago

Now that https://github.com/apple/swift-nio/pull/1804 landed, I will close this - follow up for the next steps will be in https://github.com/apple/swift-nio/issues/1831

apple / swift-nio

[Linux] evaluate io_uring() as alternative async I/o mechanism #1761