Closed hassila closed 3 years ago
Here are some results from netty who adopted it: https://github.com/netty/netty/issues/10622#issuecomment-701241587
Thanks for the heads up @hassila. io_uring has been on our radar for a while. Unfortunately it's yet to reach the top of our list as we are busy with many other things. We agree that it is likely to lead to improved performance on linux.
If you feel so inclined we'd welcome any code contributions in this direction.
Thanks @PeterAdams-A - I could have a look, it seems Ubuntu 20.04 has a fresh enough kernel that supports it (even if performance of io_uring seems to have improved quite a bit further with more recent ones than 5.4 that Ubuntu uses, but it should be usable for bringup).
Just a few basic initial questions (I'm sure I'd need some handholding on best integration with existing code-base):
I'm sure there'll be additional questions, but just some stuff that came to mind when looking at it.
What would be a best practice for supporting two implementations for a single platform and to be able to select between them?
Selecting between them should just be configuration. They can hook in underneath SelectableEventLoop
, I suspect. Swift does not have a preprocessor in the traditional sense so it cannot use #if
-based feature detection.
The inventor of io_uring also provides a convenience library (https://github.com/axboe/liburing) that removes a significant chunk of boilerplate
There's no problem with relying on this library iff its absence can be tolerated on systems that do not have iouring.
Would any of the existing tests be viewed as usable for validating performance, or how would you propose validation is done in an acceptable manner?
We have an existing performance test suite, and it should be possible to run that against any new implementation.
Exposing io_uring should probably be done in the context of https://github.com/apple/swift-system ?
It mentions SwiftNIO as one consumer, but doesn't seem like it is currently used - ok to add it as a dependency?
It is not ok to add it as a dependency, unfortunately. Until it tags 1.0 adding it as a dependency will immediately cause dependency hell if anyone else in the ecosystem depends on it.
We expose our system calls through our own internal abstraction layer: https://github.com/apple/swift-nio/blob/main/Sources/NIO/Linux.swift.
Ok, that's alright. I'll just make shims and interfaces analogous to what's already in place then.
The inventor of io_uring also provides a convenience library (https://github.com/axboe/liburing) that removes a significant chunk of boilerplate
There's no problem with relying on this library iff its absence can be tolerated on systems that do not have iouring.
Having spent some time now looking at how Swift interacts with C and what support there is in the SPM - I haven't figured out any good way to do that - except manually loading it with dlopen() and using dlsym() to resolve relevant symbols. Is that an acceptable approach (works for me)? I plan on having the implementation fall back to epoll with a warning if io_uring is specified but not available.
Yeah, I think we'd have to use dlopen
and dlsym
. @weissi do you have any objections to that approach?
I'm fine with dlopen
/dlsym
or invoking the syscalls directly using syscall
(returns ENOSYS
if unimplemented).
Is the plan to ship the relevant io_uring
headers (for struct io_uring_params
/struct io_sqring_offsets
/struct io_cqring_offsets
) in NIO itself?
The other question is if we target liburing
or the io_uring_*
syscalls directly. Netty targets io_uring_*
directly.
@weissi See the discussion above. I think targeting liburing is sensible.
Agree, it seems that liburing (also written by Jens Axboe) removes a reasonable amount of boilerplate that otherwise would be reimplemented as discussed above.
With regard to the headers, I had planned on including them such as:
// Pull in io_uring if it's availble (Linux kernel 5.1+ systems, better performance with 5.11+)
#if __has_include(<liburing/io_uring.h>)
#include <liburing/io_uring.h>
#endif
This would require liburing to be installed on a developers build system if liburing is to be used - I did not plan to pull them into NIO. If built without liburing headers, liburing support would not be compiled in and NIO would fall back on using epoll as today.
This would fundamentally bind a compiled binary to have a header version matching the shared library installed for runtime.
I guess the alternative would be to pull in liburing as a git submodule (with the caveat it may only build on newer linux systems... needs to be checked what happens) and effectively bundle both headers and .so - but I think it is probably nicer to try to support the version installed on the production/deployments hosts?
Another alternative would of course be to write all the boilerplate, fundamentally duplicating parts of liburing.
I think as a first pass we can require that to get a NIO that supports io_uring you need to build on a system with liburing development headers. If we find that this leads to unsatisfactory outcomes we can always change the behaviour later.
Ok, some initial work done:
Now that the API is accessible a few questions starts to popup, especially I'm trying to figure out the cleanest way to integrate this that would be acceptable.
I would expect that different EventLoops should be able to choose whether to use epoll or liburing? As liburing can optionally be setup with submission queue polling (IORING_SETUP_SQPOLL) through a kernel-side thread (that lives for x ms) for high-throughput scenarios (significantly reducing the need for actual syscalls) - this basically trades of CPU for better latency/throughput. It's also possible to configure both submission/completion queue sizes independently (not uncommon to require a larger completion queue according to presentations from Jens). For the users who want to use io_uring, it seems reasonable to open up for tuning these options on a per-uring basis (we would use one io_uring instance per Eventloop), as one may want to make one uring instance polling while others not.
Any Channel associated with a liburing enabled EventLoop needs to use liburing primitives also for submitting work.
So trying to break down some actual questions:
Apologies in advance for a bit "fluffy" questions, still wrapping my head around the codebase.
First of all, thank you so much for driving this forward, this is super valuable! Leaving some comments inline.
- Run into slight problem that some of the API of liburing is defined as static inline functions in the header (so cannot dlsym those), but so far all of them seems to resolve down to atomics which should be available at link time
But static inline
functions won't necessarily have any symbols at all. They may only exist either as inlined pieces of code. If Swift's clang importer can't import these, then we'd need to "wrap" them in a regular C function and we could stub them out if uring isn't available (exactly as you suggest below).
Now that the API is accessible a few questions starts to popup, especially I'm trying to figure out the cleanest way to integrate this that would be acceptable.
I would expect that different EventLoops should be able to choose whether to use epoll or liburing? As liburing can optionally be setup with submission queue polling (IORING_SETUP_SQPOLL) through a kernel-side thread (that lives for x ms) for high-throughput scenarios (significantly reducing the need for actual syscalls) - this basically trades of CPU for better latency/throughput. It's also possible to configure both submission/completion queue sizes independently (not uncommon to require a larger completion queue according to presentations from Jens). For the users who want to use io_uring, it seems reasonable to open up for tuning these options on a per-uring basis (we would use one io_uring instance per Eventloop), as one may want to make one uring instance polling while others not.
So I'd do the configuration on the EventLoopGroup
. If somebody wants N
eventloops where say one EventLoop
has a different config, then they can just start two EventLoopGroup
s, one with just 1 EL and another one with N-1
. Do you think that'd be good enough?
Any Channel associated with a liburing enabled EventLoop needs to use liburing primitives also for submitting work.
👍
So trying to break down some actual questions:
- Where should I expose configuration of whether to use liburing instead of epoll? SelectableEventLoop extended initialiser? (SelectableEventLoop can then have conditional use of liburing if enabled for the completion handling side)
Good question. I'm not even 100% sure you would even want to use SelectableEventLoop
at all. SelectableEventLoop
isn't actually public API so we could dynamically use a URingEventLoop
or so. I haven't spent a lot of time thinking about this so I think you'd be in a better place to decide.
Aside: There are a few places in NIO itself where we expect SelectableEventLoop
but I reckon that could be changed to either expect SelectableEventLoop
or URingEventLoop
.
- If so, should the actual uring options be part of that initialiser also (poll mode, submission/completion queue sizes)?
- I must admit I am a bit unclear of the best way to push in the submission side and would appreciate guidance - should I conditionalize the calls in BSDSocketAPIPosix.swift to use uring instead of syscalls? Then I'd need to add a flag to NIOBSDSocket to know whether to use uring. Not sure yet if/where NIOBSDSocket is set up for the Channel, but if the Channel is associated with an EventLoop with uring enabled it would need to be set for the socket in that case.
NIOBSDSocket
is mostly to name-space lots of socket stuff (like socket options, the right types int
vs. SOCKET
(Windows) etc) that was introduced to make Windows interop easier. The actual sockets are owned by the Socket
class. The instantiation of those is currently done in the bootstraps.
Apologies in advance for a bit "fluffy" questions, still wrapping my head around the codebase.
Not at all, please feel free to ask questions. If you're in some of the Slack/Discord channels, we're also available there for a higher throughput comms channel :)
So I'd do the configuration on the
EventLoopGroup
. If somebody wantsN
eventloops where say oneEventLoop
has a different config, then they can just start twoEventLoopGroup
s, one with just 1 EL and another one withN-1
. Do you think that'd be good enough?
Yeah, I think that'd be alright.
So trying to break down some actual questions:
- Where should I expose configuration of whether to use liburing instead of epoll? SelectableEventLoop extended initialiser? (SelectableEventLoop can then have conditional use of liburing if enabled for the completion handling side)
Good question. I'm not even 100% sure you would even want to use
SelectableEventLoop
at all.SelectableEventLoop
isn't actually public API so we could dynamically use aURingEventLoop
or so. I haven't spent a lot of time thinking about this so I think you'd be in a better place to decide.Aside: There are a few places in NIO itself where we expect
SelectableEventLoop
but I reckon that could be changed to either expectSelectableEventLoop
orURingEventLoop
.
I think maybe going after the Selector instead may be the right level, will follow up in slack instead as you suggested, just joined there now.
NIOBSDSocket
is mostly to name-space lots of socket stuff (like socket options, the right typesint
vs.SOCKET
(Windows) etc) that was introduced to make Windows interop easier. The actual sockets are owned by theSocket
class. The instantiation of those is currently done in the bootstraps.
I'll focus on getting the epoll/kqueue replacement up first before looking at outbound as a POC.
Apologies in advance for a bit "fluffy" questions, still wrapping my head around the codebase.
Not at all, please feel free to ask questions. If you're in some of the Slack/Discord channels, we're also available there for a higher throughput comms channel :)
Super, thanks, will follow up there!
- Where should I expose configuration of whether to use liburing instead of epoll? SelectableEventLoop extended initialiser? (SelectableEventLoop can then have conditional use of liburing if enabled for the completion handling side)
Good question. I'm not even 100% sure you would even want to use
SelectableEventLoop
at all.SelectableEventLoop
isn't actually public API so we could dynamically use aURingEventLoop
or so. I haven't spent a lot of time thinking about this so I think you'd be in a better place to decide.Aside: There are a few places in NIO itself where we expect
SelectableEventLoop
but I reckon that could be changed to either expectSelectableEventLoop
orURingEventLoop
.
For what it's worth: The mother?/sister? project Netty uses an IOUringEventLoop.
Short status update of current WIP.
Have done a (partly) functional bringup using liburing, with basic sample apps working ok (ie. NIOUDPEchoServer, NIOChatClient, NIOChatServer, NIOMulticastChat, NIOPerformanceTester, NIOHTTP1Server, NIOHTTP1Client, NIOEchoClient, NIOUDPEchoClient, NIOEchoServer), now working through unit tests and integration tests as time allows.
There are still a lot of cleanup to be done and a little bit of refactoring (and tuning), but basically I've just hooked into the SEL by providing a custom Selector class for uring. NIOPerformanceTester results looks promising compared to epoll but I don't want to put out any numbers until further along. If anyone that has a good performance benchmark outside of what NIO provides I'd be interested in help testing later on (prerequisites Ubuntu 20.04, update kernel to 5.11 and install liburing from github).
This approach just gives 'io_uring lite' as we still have the same conceptual "poll fds, then read when data available" - instead of fully leveraging it with "read all fds, then process data as it arrives". io_uring has a clever mechanism where one can register a pool of io buffers with the kernel, so even if doing 1M reads on different fds, we don't need to back every read up with a buffer - the kernel will let us know which one was used from the pool and we can reuse it. Quite nice. (more details, there's support for different pools with different buffer sizes too, so we can get clever and switch between pools also similar to how different allocation sizes of the receiving buffer seems to be done now). But all of that is for later, it requires more in-depth understanding of NIO and I'd need to frame some more concrete questions after further analysis on how that could proceed. In that case I think we would likely go for a completely different URingEventLoop also.
@hassila fantastic, that's amazing progress, thank you so much! I think it's totally reasonable to start using io_uring just as the eventing mechanism (io_uring lite as you call it) and later on we can switch all syscalls to their io_uring equivalent. That'd be super cool for NonBlockingFileIO
too!
There'll be a small pause on progress here, I need to spend some time on testing https://github.com/axboe/liburing/issues/310 which was added right now - this will give much better impedance matching to how SwiftNIO work with uring and eliminate a significant amount of unnecessary re-registrations that the current implementation require. I intend to use this new uring functionality when it is working, so uring support will eventually require a 5.13+ kernel (which is fine I think, older systems will be able to use epoll of course).
Ok, short update of current status:
I've migrated the implementation to instad use multishot polls / updates from https://github.com/axboe/liburing/issues/310 (so all tests are running on a 5.12rc3+multishot polls kernel, will be some time before it is mainstream).
Current results from 'swift tests' compared to a clean clone of SwiftNIO running with epoll:
epoll from mainline:
Test Suite 'All tests' failed at 2021-03-24 15:20:14.133
Executed 1311 tests, with 2 failures (0 unexpected) in 81.388 (81.388) seconds
io_uring development branch:
Test Suite 'All tests' failed at 2021-03-24 15:35:49.859
Executed 1305 tests, with 2 failures (0 unexpected) in 107.069 (107.069) seconds
uring runtime is longer due to tons of debug logging which is still in place so don’t care too much about the numbers as of yet.
There are a handful of tests that I’ve disabled that I may want to discuss how to progress with later when the code been cleaned up enough for draft PR (and one of the tests is missing as I'm lagging main with a few revisions):
SelectorTest.testWeDoNotDeliverEventsForPreviouslyClosedChannels
StreamChannelTest.testWritabilityStartsTrueGoesFalseAndBackToTrue
StreamChannelTest.testHalfCloseOwnOutput
StreamChannelTest.testLotsOfWritesWhilstOtherSideNotReading
NIOHTTP1TestServerTest.testConcurrentRequests
(NonBlockingFileIOTest.testThrowsErrorOnUnstartedPool not run yet)
For the scripts/integration_tests.sh status is fairly decent, everything passes except one of the malloc tests (test_1000_udpconnections) which is currently disabled.
> scripts/integration_tests.sh
Running test suite 'tests_01_http'
Running test 'test_01_get_file.sh'... OK (3s)
Running test 'test_02_get_random_bytes.sh'... OK (2s)
Running test 'test_03_post_random_bytes.sh'... OK (3s)
Running test 'test_04_keep_alive_works.sh'... OK (1s)
Running test 'test_05_repeated_reqs_work.sh'... OK (2s)
Running test 'test_06_http_1.0.sh'... OK (1s)
Running test 'test_07_headers_work.sh'... OK (1s)
Running test 'test_08_survive_signals.sh'... OK (1s)
Running test 'test_09_curl_happy_with_trailers.sh'... OK (2s)
Running test 'test_10_connection_drop_in_body_ok.sh'... OK (1s)
Running test 'test_11_res_body_streaming.sh'... OK (5s)
Running test 'test_12_headers_too_large.sh'... OK (1s)
Running test 'test_13_http_pipelining.sh'... OK (3s)
Running test 'test_14_strict_mode_assertion.sh'... OK (2s)
Running test 'test_15_post_in_chunked_encoding.sh'... OK (2s)
Running test 'test_16_tcp_client_ip.sh'... OK (1s)
Running test 'test_17_serve_massive_sparse_file.sh'... OK (20s)
Running test 'test_18_close_with_no_keepalive.sh'... OK (2s)
Running test 'test_19_connection_drop_while_waiting_for_response_uds.sh'... OK (2s)
Running test 'test_20_connection_drop_while_waiting_for_response_tcp.sh'... OK (1s)
Running test 'test_21_connection_reset_tcp.sh'... OK (2s)
Running test 'test_22_http_1.0_keep_alive.sh'... OK (1s)
Running test 'test_23_repeated_reqs_with_half_closure.sh'... OK (7s)
Running test 'test_24_http_over_stdio.sh'... OK (0s)
Running test suite 'tests_02_syscall_wrappers'
Running test 'test_01_syscall_wrapper_fast.sh'... OK (12s)
Running test 'test_02_unacceptable_errnos.sh'... OK (4s)
Running test suite 'tests_03_debug_binary_checks'
Running test 'test_01_check_we_do_not_link_Foundation.sh'... OK (1s)
Running test 'test_02_expected_crashes_work.sh'... OK (2s)
Running test suite 'tests_04_performance'
Running test 'test_01_allocation_counts.sh'...
...
OK (219s)
Running test suite 'tests_05_assertions'
Running test 'test_01_syscall_wrapper_fast.sh'... OK (0s)
OK (ran 30 tests successfully)
Next step is do clean up and refactor the implementation just a tad before being able to do a draft PR to discuss around the final failing tests.
Many thanks to Johannes Weiß for helping out getting a workable toolchain up and to Cory Benfield for helping out with some roadblocks today with the SAL especially, much appreciated.
@hassila Wow, this is super impressive! Great work.
Added https://github.com/apple/swift-nio/pull/1788 to be able to discuss a) current approach (split of Selectors) and remaining failing cases.
One more question with a performance aspect, I also see this pattern fairly in a few tests (e.g. EchoServerClientTest.testPortNumbers) which all occurs in the same event loop tick before calling whenReady again:
S [NIOThread(actualName = NIO-ELT-179-#0)] register interested SelectorEventSet(rawValue: 1) uringEventSet [24]
S [NIOThread(actualName = NIO-ELT-179-#0)] Re-register old SelectorEventSet(rawValue: 1) new SelectorEventSet(rawValue: 3) uringEventSet [24] reg.uringEventSet [8216]
S [NIOThread(actualName = NIO-ELT-179-#0)] Re-register old SelectorEventSet(rawValue: 3) new SelectorEventSet(rawValue: 7) uringEventSet [8216] reg.uringEventSet [8217]
This translates to Registration of reset Change it to reset+readEOF And then change it to reset+readEOF+read
Is there a good reason for this pattern and is it common?
If so, would it be a possibility to elide all changes to registrations and only perform the low level kqueue/uring/epoll changes just in the beginning of whenRead again instead of doing them inline with the actual call? I'm thinking that given that it all happens in the same tick, it would semantically be the same, but it would allow us to make all the changes with one systemcall instead of three. So basically just keep track of the pending changes to the registration and at the top of whenReady execute them before reaping new events. (it would be extra attractive for uring which basically can merge all system calls for all different fd:s if done that way...)
Here is another example from EchoServerClientTest.testConnectingToIPv4And6ButServerOnlyWaitsOnIPv4 where it moves over 4 different stages during the same tick (going back to the original one) - I understand the need for changing the actual registration, but it seems eliding the change of the external (kqueue/poll/uring) registration could be quite beneficial if these are representative for real-world uses.
S [NIOThread(actualName = NIO-ELT-181-#0)] whenReady.blockUntilTimeout
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_wait_cqe_timeout.ETIME milliseconds __kernel_timespec(tv_sec: 0, tv_nsec: 248841960)
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_wait_cqe_timeout CQE:s [0x00007fb29cc8edd0] - ring flags are [0]
L [NIOThread(actualName = NIO-ELT-181-#0)] 0 = fd[17] eventType[Optional(NIO.CqeEventType.poll)] res [4] flags [2] bitpattern[Optional(0x0000000100000011)]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_wait_cqe_timeout fd[17] eventType[Optional(NIO.CqeEventType.poll)] bitPattern[4294967313] cqes[0]!.pointee.res[4]
S [NIOThread(actualName = NIO-ELT-181-#0)] We found a registration for event.fd [17]
S [NIOThread(actualName = NIO-ELT-181-#0)] selectorEvent [SelectorEventSet(rawValue: 8)] registration.interested [SelectorEventSet(rawValue: 9)]
S [NIOThread(actualName = NIO-ELT-181-#0)] intersection [SelectorEventSet(rawValue: 8)]
S [NIOThread(actualName = NIO-ELT-181-#0)] running body [NIOThread(actualName = NIO-ELT-181-#0)] SelectorEventSet(rawValue: 8) SelectorEventSet(rawValue: 8)
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_poll_update fd[17] oldPollmask[28] newPollmask[28] userBitpatternAsPointer[Optional(0x0000000200000011)]
S [NIOThread(actualName = NIO-ELT-181-#0)] Re-register old SelectorEventSet(rawValue: 9) new SelectorEventSet(rawValue: 11) uringEventSet [28] reg.uringEventSet [8220]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_poll_update fd[17] oldPollmask[28] newPollmask[8220] userBitpatternAsPointer[Optional(0x0000000200000011)]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush io_uring_submit needed [1] submission(s), submitted [2] SQE:s out of [2] possible
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush done
S [NIOThread(actualName = NIO-ELT-181-#0)] Re-register old SelectorEventSet(rawValue: 11) new SelectorEventSet(rawValue: 15) uringEventSet [8220] reg.uringEventSet [8221]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_poll_update fd[17] oldPollmask[8220] newPollmask[8221] userBitpatternAsPointer[Optional(0x0000000200000011)]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush io_uring_submit needed [1] submission(s), submitted [1] SQE:s out of [1] possible
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush done
S [NIOThread(actualName = NIO-ELT-181-#0)] Re-register old SelectorEventSet(rawValue: 15) new SelectorEventSet(rawValue: 7) uringEventSet [8221] reg.uringEventSet [8217]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_poll_update fd[17] oldPollmask[8221] newPollmask[8217] userBitpatternAsPointer[Optional(0x0000000200000011)]
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush io_uring_submit needed [1] submission(s), submitted [1] SQE:s out of [1] possible
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush done
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush
L [NIOThread(actualName = NIO-ELT-181-#0)] io_uring_flush done
S [NIOThread(actualName = NIO-ELT-181-#0)] whenReady.block
7 = read + reset + readEOF
9 = write + reset
11 = write + reset + readEOF
15 = write + reset + readEOF + read
7 = read + reset + readEOF
For uring, it could be beneficial anyway even if not super common as we would only have the single sys call for all 'external' registration modifications and not one per fd as for epoll/kqueue.
Yup, this is expected behaviour. We can break out the registrations/deregistrations and talk about when they happen, which can help shed some light on how this happens.
reset
. This registration is always active, as it is fundamentally unmaskable on Linux.readEOF
. This registration is made when the channel becomes active, and is only removed when the channel actually reads EOF. If the channel never hits readEOF
for some reason then the registration will never be removed, as it will be implicitly removed when the channel is unregistered from the selector.write
. This registration happens whenever we get EAGAIN
on a write, and will persist until the Channel write buffer is drained (note that this is the Channel write buffer, not the kernel write buffer). This will tend to flap on and off, but not as fast as...read
. This registration is managed carefully because it interacts with NIO's backpressure mechanism. In particular, the way NIO backpressure works is that NIO assumes that all channels do not want to read until they ask to, and then that they only want to read once. This is a "pull"-based method of exerting backpressure, rather than a "push"-based one. Pull-based methods are substantially easier to understand, generally speaking. In some cases this can make read
appear to flap. Now, read
should not flap in most cases because autoRead
will usually prevent it flapping, but it can't guarantee it.Note that nowhere in here does this say the registrations have to actually happen when the channel asks for them. A selector is free to "batch" them up so long as it never delivers an event that the Channel didn't ask for. The contract is, if the Channel hasn't registered for it, then the event must not be delivered to it. So long as the iouring based selector obeys that contact, it can turn this into syscalls however it wants.
Thanks @Lukasa - that is what I hoped, I will try to batch and elide all changes done to registrations then.
A bigger discussion is ET vs LT (as we touched upon earlier), as the current non-exhaustive reads (and writes?) basically forces the implementation to prod the kernel unnecessarily. If it was possible to understand from a given registration if exhaustive reads/writes have/haven't been done, we could avoid that prod which would make a major difference.
In this instance I think rather than do that work we should think to the long term, which is an event loop that submits the work to io_uring directly rather than using it as a glorified epoll. :wink:
That's fair enough, the additional prodding currently seem to make it perform less well than epoll (although if https://github.com/axboe/liburing/issues/321 turns up it may turn the tables, as we then would presumably skip receiving CQE:s for the prods that doesn't generate any data).
Fundamentally things are getting fairly close to working (so we at least have infrastructure to go forward with a proper EL), but it seems a proper uring el will be needed to really make a difference.
I'd want to get the glorified epoll up and running and completing all tests first though...
Any pointers on appropriate place of trying to hook in for read/write would be appreciated - basically we need to manage a set of read and write buffers on the EL level which can only be reused / free:d when getting the CQE:s later on. So ownership of the memory must be by the EL.
I think practically speaking we'll need to diverge away from SelectableEventLoop
entirely: the mechanism used by io_uring is incompatible with the SelectableEventLoop
model where the various channels own their I/O operations.
@Lukasa not sure if this is true... we use io_uring in a very similar way as we use epoll etc in netty so I guess you could do the same. We basically still use non-blocking fd's etc. Maybe this is helpful:
@normanmaurer and I chatted offline. TL;DR: he agrees with me. :wink:
haha yeah @Lukasa is right.. I misunderstood him.. That said the link I provided is still useful I think. Also how we use the submission and completion queues:
https://github.com/netty/netty-incubator-transport-io_uring/blob/main/src/main/java/io/netty/incubator/channel/uring/IOUringSubmissionQueue.java https://github.com/netty/netty-incubator-transport-io_uring/blob/main/src/main/java/io/netty/incubator/channel/uring/IOUringCompletionQueue.java
Ok, updated current state in https://github.com/apple/swift-nio/pull/1788#issuecomment-810393146 - please check if you want anything salvaged there.
New PR in https://github.com/apple/swift-nio/pull/1804.
Now that https://github.com/apple/swift-nio/pull/1804 landed, I will close this - follow up for the next steps will be in https://github.com/apple/swift-nio/issues/1831
Stumbled across another (new) async I/o interface that seems worthwhile considering for best possible performance on Linux. See some background here:
https://lwn.net/Articles/810414/ https://kernel.dk/io_uring.pdf
And one simple test vs epoll with performance numbers: https://github.com/frevib/io_uring-echo-server/blob/io-uring-op-provide-buffers/benchmarks/benchmarks.md
Is this something of interest?