axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.85k stars 402 forks source link

Do sequential io_uring_prep_send requests require linking? #1001

Closed andyg24 closed 10 months ago

andyg24 commented 11 months ago

I already posted this as a discussion topic 5 days ago but didn't get any traction.

For operations where the order of completion is significant, such as send requests to a TCP socket, can I assume that requests will be processed (i.e. serialized) in the order of submission, at least when submitted to the same ring by the same thread? Or do I still need to use IOSQE_IO_LINK in that situation?

It would be good to clarify this in the man page, as I couldn't easily find the answer.

jmillan commented 11 months ago

I'm not an expert in io_uring but I'm quite certain that the SQEs are processed sequentially, or order of submission, yes.

axboe commented 11 months ago

The first issue attempt is always done in order, that's the only way we can pull SQEs off the SQ ring. If your send request completes inline (eg there's space in the socket to put the data in there), then it'll be done before you return. However, if the socket buffer is full, then an internal poll handler is armed and when POLLOUT is triggered we'll complete it. After this poll handler is armed, we will process the next SQE in the ring, if any.

In other words, the answer is "it depends". We always issue in order, but if requests need poll triggering to complete, or if they need to punt to io-wq for processing, then you can have later SQEs in the ring being issued and completed before prior SQEs are fully issued.

isilence commented 11 months ago

Rewording the answer, it's not safe to assume that SQEs will be processed in order. In practise you may notice that they're executed in order, but that's only until sth happens.

Let me add that IOSQE_IO_LINK doesn't work across different submission syscalls (`i.e. io_uring_submit() and others). So with in the example below there will be two separate not-linked requests.

sqe1 = prep_sqe();
sqe1->flags |= IOSQE_IO_LINK;
submit();

sqe2 = prep_sqe();
submit();

And since traversing the kernel's networking is pretty expensive, you'll be far better off, if instead of sending 2 requests, you'd squash them into one request by packing data into a single iovec array.

andyg24 commented 11 months ago

Is it the case then that the only way to use io_uring_prep_send() correctly with a TCP socket is to wait for everything to complete before issuing another submit()?

Because even with IO_LINK or when using an iovec array, the order of serialization to the socket is not guaranteed across submit() calls.

If so, that seems to negate a lot of the benefit of using an async interface for networking (the fact that you have to drain everything before issuing another submit() call), especially when networking is mixed with disk IO on the same ring.

Could there be a mechanism where the order of serialization could be guaranteed across submit() calls? For example, if requests for a socket were previously added to io-wq for processing, could we somehow keep track of that and assign future requests to the same queue, even if the socket becomes unblocked?

isilence commented 11 months ago

Is it the case then that the only way to use io_uring_prep_send() correctly with a TCP socket is to wait for everything to complete before issuing another submit()?

Because even with IO_LINK or when using an iovec array, the order of serialization to the socket is not guaranteed across submit() calls.

Right, the most basic way is to have only one request in flight per socket at any point in time. And before queuing more data to a socket you'd wait for the previous completion for that socket.

If so, that seems to negate a lot of the benefit of using an async interface for networking (the fact that you have to drain everything before issuing another submit() call), especially when networking is mixed with disk IO on the same ring.

I don't think so, it might be a bit inconvenient for the userspace but doesn't change the math. First, you don't have to drain everything. It's just before sending more data to a socket you need to wait for the previous request completion for that particular socket. And it's still asynchronous, i.e. while waiting for a completion you can process other sockets / files / requests. There is no problem with disk IO whatsoever in that regard.

Also, getting a send request completion doesn't mean that the data was actually put onto the wire or delivered to the other end, it's just saying that you can queue more data, so there is no latency problem for that.

For example you can look up as it's implemented in folly. There is a class for socket, and when a user tries to send but there is a request in flight, it only add the data to the socket backlog. When the socket gets the send completion it'll send out all the backlog.

That works even better because you have more time to batch, which will ammortize the networking stack overhead by sending fewer requests with more data.

Could there be a mechanism where the order of serialization could be guaranteed across submit() calls? For example, if requests for a socket were previously added to io-wq for processing, could we somehow keep track of that and assign future requests to the same queue, even if the socket becomes unblocked?

That'd be complicated and at the same time can be easily and cheaply done in userspace

andyg24 commented 11 months ago

Thank you, this is very helpful.

It didn't cross my mind that only a single request to a socket could be in flight at any given time. May I ask what happens when there are only a few bytes available in the socket's buffer and a large send request with MSG_WAITALL is submitted? Will I get a single completion event when the entire message is flushed to the socket's buffer? Will an intermediate copy of the message be made inside of io_uring (i.e. do I need to preserve the message buffer in userspace after submission, especially when using SEND_ZC)? Can a SEND request ever fail when used with MSG_WAITALL (assuming the socket is not closed, etc)?

Without MSG_WAITALL, I assume I will get a short write if the entire message cannot be flushed to the socket's send buffer.

isilence commented 11 months ago

May I ask what happens when there are only a few bytes available in the socket's buffer and a large send request with MSG_WAITALL is submitted? Will I get a single completion event when the entire message is flushed to the socket's buffer?

Yes, for OP_SEND there should be only one completion. And for SEND_ZC, unless there was an early fail it'll also post the notification CQE.

Note, there are features, e.g. multishot recv, that produce >=1 completions, but nothing like that for sends.

Will an intermediate copy of the message be made inside of io_uring (i.e. do I need to preserve the message buffer in userspace after submission, especially when using SEND_ZC)?

There won't be any copies, you have to keep the buffer alive until you get the completion back. And MSG_WAITALL nor any other flag changes that. It's fine though to free struct msghdr and/or the iovec itself once the submit syscall returns with !SQPOLL rings. With SEND_ZC you'd need to wait for the notification CQE before you can reuse / release buffers.

Can a SEND request ever fail when used with MSG_WAITALL (assuming the socket is not closed, etc)?

It can, and it can also be a short send, even though it's not very likely. Mostly because we can't guess where and how lower layers can fail, but we also can't do anything if allocations at the io_uring side fail.

Without MSG_WAITALL, I assume I will get a short write if the entire message cannot be flushed to the socket's send buffer.

right