Refactor Channels and MultiThreadedEventLoopGroup for io_uring

cc @hassila @weissi

As part of the io_uring PR (#1788) we had a number of discussions about how we might need to evolve the NIO codebase to fully support io_uring, instead of using it in the "glorified epoll" mode we do today. I've thought on this some more and want to lay out a proposal for how we might evolve the current implementations to make it much easier to implement an io_uring based event loop, ideally without any API change at all and with minimal changes to the Channels themselves.

Problem Statement

io_uring operates, at a fundamental level, by making system calls asynchronous. Each system call is enqueued in the submission queue in the form of an SQE. Some time later, the kernel will either poll for an entry or we will call io_uring_enter and the kernel will begin processing the submitted system calls. Finally, at yet another time in the future, we will read from the completion queue, which will notify us of some of the completed events.

The first and most obvious problem we have here is that this is not how the current NIO Channels operate. In NIO today, the Channels are responsible for making their own I/O system calls: they call read/readmsg/readmmsg/write/writev/sendmsg/sendmmsg themselves, in response to both user activity and the notifications from the selectable event loop. Additionally, they assume all system calls are synchronous and non-blocking, meaning that they can issue these system calls directly without blocking the event loop and will synchronously find out whether there is space in the buffer.

In io_uring, system calls are not synchronous, meaning they do not return right away. Indeed, we aren't making system calls at all, we're enqueuing I/O operations onto the submission queue. From time-to-time the event loop will want to make a system call, but those times are not directly correlated with when the Channel wants to make its system call. Additionally, they work on top of a limited shared resource: the submission queue. The submission queue is not arbitrarily large, and when it fills up we cannot submit further I/O work to it until the kernel has processed it. This means we need an efficient way to know what I/O operations have been "enqueued" without actually making it to the kernel, so we can process them as space appears in the submission queue.

Proposed Solution

We had originally discussed implementing new Channels and new Event Loops in order to support the io_uring model of operation. I think I have a better option: let's make Event Loops do I/O, instead of Channels.

This change requires moving a bunch of work out of the Channel and into the EventLoop. We cannot simply move the I/O operations (i.e. change the PendingWritesManagers to call EventLoop functions instead of system calls), because the current pending writes manager backpressure implementation relies on having synchronous feedback about write completion.

The result of this is that we need to perform multiple steps, roughly as follows:

Enhance the Channel to EventLoop outbound API to submit I/O operations to the EventLoop directly.

The goal of this is to allow the EventLoop to be in charge of executing the I/O operations itself. For the current selector-based implementation we will continue to emulate our existing pattern by executing the I/O directly.
Enhance the pending writes managers to be able to be told about I/O results asynchronously.

This is a bit awkward, but necessary. In the io_uring flow we won't know about the result of an I/O event until sometime later. This means we need to carefully refactor the pending writes managers to ensure that they use the exact same path for async and sync data responses: if they get a synchronous result of I/O they must go through the same path as the asynchronous result.
Event loops must have a queue of pending submitted I/O events that have not yet completed.

When a channel wants to submit I/O to the event loop, it should be able to do so, even if there isn't space in the submission queue. Concretely, we want to ensure that event loops are able to maintain an order of I/O operations to ensure fairness of I/O dispatch.

This queue can potentially get extremely large, so we likely do need to provide an upper bound on it. In principle pending writes managers can constrain the size of this queue because they are going to be providing backpressure. We can mitigate the size of the queue somewhat by using it as a notification queue, storing only 8 bytes of state per Channel (roughly, storing "which Channel is next to do I/O"). This also helps us address the problem of a Channel being closed after it enqueues I/O without forcing us to clear out the pending queue.
Event loops must be asked to perform reads on the Channel's behalf

This change is probably the easiest to implement. Right now, unlike with writes, Channels never optimistically read except for when they have received EOF. They always ask for readability notifications and respond to them. Moving the request to perform a read out of the Channel and into the EventLoop is reasonably straightforward.

Care must be taken with this refactor to preserve the EOF behaviour, however, and I don't know how best to do this.

Timeline

While this work is important, it's not currently my highest priority, so I wanted to throw these thoughts down somewhere they can be referenced by others. I think it would be useful to implement these refactors on top of our existing event loops, assuming that they do actually make space to implement io_uring-loops correctly. This should provide us with an opportunity to integrate io_uring more smoothly into our I/O path.

@Lukasa Thanks for writing this up! I think it would probably be worth thinking about the number of core abstractions in NIO again. Currently (and AFAIK even if this proposal gets implemented as is, the lower layers of NIO are essentially

Channel (I/O operations)
EventLoop (I/O eventing + executing tasks)

(ChannelHandlers, pipelines, ... all sit on top/inside a Channel so don't really matter all too much for this discussion).

As you point out, with io_uring, the split of the I/O operations and the I/O eventing does no longer work, unless (as previously discussed) we create a new set of EventLoops and Channels specific to io_uring. So I can see the appeal of what you propose which is to unify all of the I/O (including for epoll/kqueue the eventing) in the EventLoops.

But if we're doing such a big rethink, I think (for NIO 3+) we should also consider the extensibility of SwiftNIO. With async/await coming we have the unique opportunity to over time remove most of SwiftNIO out of the APIs of most user-facing libraries (such as HTTP clients, server, ...) which means most of SwiftNIO can become an implementation-only dependency to most libraries. So in the future, if one implements a library with SwiftNIO the only types that will likely leak into the API are EventLoop(Group) (once we have the custom executors especially).

So what I'm thinking is if we should split the EventLoops into 2 pieces:

(in some to-be-created NIOExecutors module) the "executor" piece which will be the custom executor (and hopefully an MPSC queue in the future)
(in the NIO module) the I/O pieces

On top of that, we can then have much more lightweight Channels which just ask the underlying I/O piece to do I/O and asks the underlying "executor" piece to execute work.

Maybe this work can be separated from what you propose here but I think this is probably worth a holistic look given the huge opportunity that async/await is regarding the extensibility options that may open up if NIO were an implementation-only package.

Many thanks for the writeup @Lukasa, I definitely want to provide feedback, will try to get time during next week to do so, little bit swamped at the moment.

Sorry for the late feedback @lukasa, I have read through what you wrote several times now and wanted to try to provide some hopefully useful feedback - I've started a few times and stopped, but finally something:

The submission queue is not arbitrarily large, and when it fills up we cannot submit further I/O work to it until the kernel has processed it. This means we need an efficient way to know what I/O operations have been "enqueued" without actually making it to the kernel, so we can process them as space appears in the submission queue.

Even though the submission queue isn't arbitrarily large, I would argue that in case it's full when trying to schedule I/O, it is probably better to flush pending items visavi the kernel rather than enqueuing them in user space for later submission.

The submission queue can be configured reasonably large and flushing pending SQE:s if it actually does full up, provides a reasonable back pressure mechanism visavi aggressive / inefficient producers instead of queuing.

For a really I/O intensive setup with a dedicated SQPOLL thread on the kernel side, I would be surprised if a properly configured uring would fill up at all.

This would contradict your point 3 in the implementation proposal though, but I'd probably prefer to provide back pressure "early" if one truly manages to push the uring to its limit.

I think I have a better option: let's make Event Loops do I/O, instead of Channels.

+1 - it's great if the EL can own the I/O operation and related buffers (as discussed previously for true async reads in https://github.com/apple/swift-nio/issues/1805).

Also agree with @weissi that it might make sense to reconsider the abstractions in use - perhaps it's not really conceptually an EL anymore, but rather an "I/O subsystem" for the backend piece (@weissi point 2 as I take it).

Even though the submission queue isn't arbitrarily large, I would argue that in case it's full when trying to schedule I/O, it is probably better to flush pending items visavi the kernel rather than enqueuing them in user space for later submission.

Just wanted to note that this is exactly what we do in netty... When there is no space left we just call io_uring_enter before adding things

apple / swift-nio