axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.7k stars 393 forks source link

Using selected buffers feature in io_uring_prep_write(3) #1126

Closed gxuu closed 2 months ago

gxuu commented 2 months ago

Two questions: a) Is it possible to use the selected buffer feature in a write request? b) Is it possible to provide a buffer to another ring, if this buffer was used by another buffered ring (and is now returned to the user)?

With respect to (a):

/* Suppose initialization and appropriate setting have been done to this io_uring struct */
int o_bgid = 2; /* out buffer group id */
io_uring_buf_ring* buf_ring = io_uring_setup_buf_ring(ring, entries, o_bgid, 0, flags);
/* Here, buffer_id is a unique buffer id with respect to o_bgid, and size_mask is the result of io_uring_buf_mask(4) */
/* Also suppose that buffer stores content I would like to write */
io_uring_buf_ring_add(buf_ring, buffer, 4096, buffer_id, size_mask, 0);
io_uring_buf_ring_advance(buf_ring, 1);
io_uring_sqe* sqe = io_uring_get_sqe(ring);
/* Should I use it like this? proxy example reads something like this, but with io_uring_prep_send */
io_uring_prep_write(sqe, fd, NULL, 0, 0);
/* It seems like calling prep_write clears out sqe->flags, so put it after prep_write call */
sqe->buf_group = o_bgid;
sqe->flags |= IOSQE_BUFFER_SELECT;

/* Later, this write event will be submit and wait, of course. */

And with respect to (b): Same code example as (a), but suppose buffer is a selected buffer from another buffered ring, such that it has been used and returned to the user. I assume buffer can be used, because the kernel doesn't really care. If it indeed can be used, I am wondering when it can be used?

Thanks gxu

gxuu commented 2 months ago

Plus, can I write to a buffer, that is already provided to a buffer ring as a selected buffer?

axboe commented 2 months ago

a) writes don't currently support provided buffers. the reason behind provided buffers it is that for reads/receives, there can be a big gap between issuing the read and until the buffer actually gets picked. If you have N reads inflight for such a setup, you don't necessarily want N buffers pinned for the duration either. Maybe some of these reads never trigger. For writes, this generally isn't a concern, as the write send side completes quickly.

That said, I did recently implement provided buffer support for sends, as it provides a way to serialize sends (and other things, like making sends more efficient as you can pack multiple buffers into a single send). But for write itself, not sure I see the use case. It'd obviously be trivial to add.

b) Once a buffer is returned to the application via CQE with IORING_CQE_F_BUFFER set, the application owns it and the kernel has no knowledge of it. You are free to do with it what you want - you can either free it, provided it back to the same ring, or provide it to another ring. That's completely up to you.

Plus, can I write to a buffer, that is already provided to a buffer ring as a selected buffer?

You may get to keep both pieces if you do that, as you may not know when the kernel picks that buffer and does something with it. So no, that'd be like doing a write(2) with a buffer and then from another thread changing the contents. It may work, but it depends entirely on timing which you don't necessarily control.

gxuu commented 2 months ago

Thanks for the answers.

I asked (a) because I want to use read/write on a socket. Is it better, when talking to a socket, to use prep_send and prep_recv?

For the use case, imagine an echo server that not only echos back, but also write the content from remote ends to a local file, if we have provided buffer support on write, then we can do prep_write no matter what the fd is. Makes the code simpler, IMHO.

axboe commented 2 months ago

I'd use recv/send on the socket and read/write on a regular file. There are various optimizations that can be done when we know we're dealing with a socket, the read/write path is very generic in that it needs to handle any kind of file type.

gxuu commented 2 months ago

Okay. And to use send with provided buffers, we set things that we do for recv multishot?

Namely like this,

sqe->buf_group = buffer_group_id;
sqe->flags |= IOSQE_BUFFER_SELECT;

I see one can also set flags in, say, io_uring_prep_send. Is the last parameter of io_uring_prep_send and io_uring_prep_recv_multishot flags to the request or flags to corresponding sys calls?

axboe commented 2 months ago

Yeah you'd use it the same way for provided buffers, but not that (as mentioned) this isn't currently supported in any released kernel. The kernel patches for it are here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-recvsend-bundle

The flags for those two prep helpers are the MSG_* flags.

gxuu commented 2 months ago

So I cannot use send with provided buffers yet? I thought I could and see weird things, alas (I assume examples can run on the newest kernel available, I use arch :).

When I try to use io_uring_prep_recv_multishot, it seems like one shot was triggered while I haven't send any message to the application. Is this normal?

Also there was another problem when I try with io_uring. When I set IORING_FEAT_FAST_POLL or IORING_DEFER_TASK_RUN when initializing with io_uring_queue_init_params, I get -EINVAL. Though I am sure that I should be able to use it, as the documentation suggests a kernel version of 5. and I am using 6..

gxuu commented 2 months ago

I think I should be able to use it in the next release of kernel, right?

And many thanks for clarifications.

gxuu commented 2 months ago

Oh, my bad. Check IORING_CQE_F_MORE with invert logic.

gxuu commented 2 months ago

I tried to use io_uring_prep_recv_multishot with last three parameters NULL, 0, and 0, and get value EINVAL. This certainly is not an error from recv since I did not set MSG_OOB in flags.

I have set sqe->flags and sqe->buf_group correctly. Also, I have added the buffered ring correctly. When I change the call to io_uring_prep_read_multishot, it worked.

Any tips for how to look into the problem?

isilence commented 2 months ago

Also there was another problem when I try with io_uring. When I set IORING_FEAT_FAST_POLL or IORING_DEFER_TASK_RUN when initializing with io_uring_queue_init_params, I get -EINVAL.

IORING_SETUP_DEFER_TASKRUN have to go with IORING_SETUP_SINGLE_ISSUER, I'd bet that's the reason

isilence commented 2 months ago

So I cannot use send with provided buffers yet? I thought I could and see weird things, alas (I assume examples can run on the newest kernel available, I use arch :).

So I cannot use send with provided buffers yet? I thought I could and see weird things, alas (I assume examples can run on the newest kernel available, I use arch :).

Have you thought through how you're going to use it? Consider that you will need to have a separate io_uring's buffer pool (i.e. group) per socket, otherwise you wouldn't know what socket sends what and where. Another difference comparing with recv, is that with receives you give a buffer, but data comes asynchronously from the kerne. For sends though, you pass a buffer already with data, so the data is available at the time you do sqe_send_prep*() and with same effect you can just queue a usual send request.

In other words, I'd encourage to pull the branch, build the kernel, write an app using it and see whether it's what you actually want. And this way you would be able to participate in designing the interface / validating it before it hits upstream and couldn't be changed.

axboe commented 2 months ago

IORING_SETUP_DEFER_TASKRUN have to go with IORING_SETUP_SINGLE_ISSUER, I'd bet that's the reason

Or trying to pass in IORING_FEAT_FAST_POLL to begin with, which is an out flag and just reports if that's available or not, it's not a setup flag.

gxuu commented 2 months ago

Wasn't monitoring this issue.

Or trying to pass in IORING_FEAT_FAST_POLL to begin with, which is an out flag and just reports if that's available or not, it's not a setup flag.

IORING_SETUP_DEFER_TASKRUN have to go with IORING_SETUP_SINGLE_ISSUER, I'd bet that's the reason

Indeed after experiments I found out that when setting flags as IORING_SETUP_CLAMP | IORING_SETUP_CQSIZE | IORING_SETUP_DEFER_TASRUN | IORING_SETUP_SINGLE_ISSUER, the initialization is successful. Was careless reading man pages. Didn't notice that IORING_FEAT_FAST_POLL is an out flag.

Though, when oring the above flags together with IORING_SETUP_SQPOLL, initialization failed. My inspection is that SQPOLL cannot be set together with DEFER_TASKRUN. Example proxy.c also suggests this. Perhaps add documentation to man page?

Also, is it possible to not set SINGLE_ISSUER if changing design? I would like to use this feature and have multiple threads submitting requests. Consider a multi-threaded network server. Each client will talk to the server in their own thread. I would design it such that each thread (aka each client-server channel) have their own ring to submit requests, so that when one fd get stuck, other fd won't be effected. But with SINGLE_ISSUER set I cannot do that, is it correct?

Have you thought through how you're going to use it? Consider that you will need to have a separate io_uring's buffer pool (i.e. group) per socket, otherwise you wouldn't know what socket sends what and where. Another difference comparing with recv, is that with receives you give a buffer, but data comes asynchronously from the kerne. For sends though, you pass a buffer already with data, so the data is available at the time you do sqe_send_prep*() and with same effect you can just queue a usual send request.

Yea. Don't want to manage send queue myself. This is the same reason why proxy.c was implement with send_ring option. However, I am not so sure about this. Here's my inspection:

If a program submit (non-multishot) sends to io_uring, then it is NOT guaranteed that io_uring will process those requests in the given order (from the earliest prep to the latest prep). However, when submit using a buf_ring, io_uring process the requests using the order as they were added into the buf_ring using io_uring_buf_ring_add (as proxy.c suggests).

Therefore if I don't use buffered ring to submit requests, then I will have to maintain a queue and submit them one by one. I don't think this is a good idea: Suppose I have a, b, c in queue, then I will have to submit_and_wait on a, getting the result when processing it, and then submit_and_wait on b. Even though these three requests are on the same fd.

If my above reasoning is not correct, or is not optimal, is there any other way I can do this? The only reason I want multi shot send is that I don't want to manage the send orders myself.

In other words, I'd encourage to pull the branch, build the kernel, write an app using it and see whether it's what you actually want. And this way you would be able to participate in designing the interface / validating it before it hits upstream and couldn't be changed.

Is it merged yet? How do I see whether a patch is merged? I have tried doing this and didn't reaching anywhere. I'm a newbie and this seems to be something beyond my ability. Building kernel is probably too time-consuming as well, as my current Linux box runs on a very old laptop with CPU being i5-6200u and 8 GB ram.

gxuu commented 2 months ago

It feels like I didn't explain my intent well enough in my last comment. Suppose a, b, c are send requests that I want to submit. Suppose I want the other end to receive these requests in order a, b, c. Also suppose that these send requests are all on the same fd.

Then my inspection is this:

With non-multi-shot send, I must first prep request a. Do submit_and_wait. When I know request on a has been accomplished, I prep on request b and submit_and_wait. If I prep on ALL three requests and then do a SINGLE submit, then it is not guaranteed by io_uring that a will be processed before b.

With multishot-send, I simply add these requests to a buffered ring and do ONE submit_and_wait.

gxu

axboe commented 2 months ago

It feels like I didn't explain my intent well enough in my last comment. Suppose a, b, c are send requests that I want to submit. Suppose I want the other end to receive these requests in order a, b, c. Also suppose that these send requests are all on the same fd.

Then my inspection is this:

With non-multi-shot send, I must first prep request a. Do submit_and_wait. When I know request on a has been accomplished, I prep on request b and submit_and_wait. If I prep on ALL three requests and then do a SINGLE submit, then it is not guaranteed by io_uring that a will be processed before b.

With multishot-send, I simply add these requests to a buffered ring and do ONE submit_and_wait.

With multishot OR bundle send, you'd just need a single send submitted and it'd do a, b, and c in order. Note that multishot send isn't really a thing, it's just bundles now, as multishot doesn't necessarily make a lot of sense for send as there's no way to trigger for newly added buffers. For bundles, if you have {a, b, c} buffers ready, it'll go out a as a single send with those three buffers, rather than three separate completions for each buffer. I just posted v2 of the patchset:

https://lore.kernel.org/io-uring/20240420133233.500590-2-axboe@kernel.dk/

axboe commented 2 months ago

Also, is it possible to not set SINGLE_ISSUER if changing design? I would like to use this feature and have multiple threads submitting requests. Consider a multi-threaded network server. Each client will talk to the server in their own thread. I would design it such that each thread (aka each client-server channel) have their own ring to submit requests, so that when one fd get stuck, other fd won't be effected. But with SINGLE_ISSUER set I cannot do that, is it correct?

If each thread has their own ring, then you can use SINGLE_ISSUER just fine. You just cannot have multiple threads doing submits (or waits, with DEFER_TASKRUN) on each others rings.

gxuu commented 2 months ago

Thanks for clarify things out. Some of the things are still mysteries to me:

If each thread has their own ring, then you can use SINGLE_ISSUER just fine. You just cannot have multiple threads doing submits (or waits, with DEFER_TASKRUN) on each others rings.

That's very good news.

gxuu commented 2 months ago

And two more things:

Many thanks;

isilence commented 2 months ago

It feels like I didn't explain my intent well enough in my last comment. Suppose a, b, c are send requests that I want to submit. Suppose I want the other end to receive these requests in order a, b, c. Also suppose that these send requests are all on the same fd.

Then my inspection is this:

With non-multi-shot send, I must first prep request a. Do submit_and_wait. When I know request on a has been accomplished, I prep on request b and submit_and_wait. If I prep on ALL three requests and then do a SINGLE submit, then it is not guaranteed by io_uring that a will be processed before b.

With multishot-send, I simply add these requests to a buffered ring and do ONE submit_and_wait.

You can send them all together. pseudo coded:

struct iov iov[] = { {req_a_buf, ...}, {req_b_buf, ...}, {req_c_buf, ...}, };
struct msghdr msg = {
    ...
    msg_iov = &iov,
    msg_iovlen = ARR_SIZE(iov),
};
sendmsg(&msg);

That's if you have all 3 buffers available at the same moment. Otherwise you send whatever you have, let's say {a}. Then you collect buffers, and when "a" completes, and send the rest together {b,c}. Then consider that you need a separate pbuf ring per socket, which means they should be shallow to be memory efficient, which means you can potentially overrun them, in which case any serious application would still need to fallback, e.g. keeping a backlog of sends.

If you consider that sends are usually immediately completed when submit() returns, and that when they're not, it'll wait for half tx window until you could* push more data, that would make you not too easy to craft a case where send rings would greatly outperform the old good sendmsg. And I say "greatly" because potentially there might be some wins from using rings, e.g. send vs sendmsg or other, but I don't have numbers for that.

axboe commented 2 months ago

Send ring (snd_ring) and Send bundles (snd_bundle) are two different things. One can choose to use only send ring and not turn on send bundles. It is suggested to turn on send bundles however, since it groups multiple send requests into one and makes operation faster. And to turn on send bundles, one set sqe->ioprio |= IORING_RECVSEND_BUNDLE, plus setting what would be set if the call was a read_multishot. Is this correct?

It's really not, a bundle is just multiple buffers from the same group. You could say that send ring is a subset of bundles, where it just sends one at the time. And yeah, you'd set them up the same way, and then set IORING_RECVSEND_BUNDLE like that.

With only send ring turning on, one doesn't fruit from performance increase from bundle, but can fruit from the ordering a buffered ring provides. Is this correct?

That is correct.

You clarified send ordering for buffered send situation. Was my reasoning to the non-buffered send correct? How would one process send requests efficiently if send ordering is desired? My solution would be obviously slow. Was my reasoning about "io_uring decides which request, among all submitted requests, gets processed first" correct?

Without provided buffer support for send, you can only have a single one inflight per socket at any point in time, if you care about ordering. You could of course use sendmsg() and append to a vector, like Pavel described. That'll still give you a single send inflight, it just has multiple vectors with data.

axboe commented 2 months ago

send bundle is a mystery to me. You mentioned in the patch that it group sends together. However, I look through the code here https://lore.kernel.org/io-uring/20240420133233.500590-6-axboe@kernel.dk/ but couldn't find anything related to testing a socket. Does that mean I actually cannot put send requests to different sockfd into the same buffered ring (if we want to use send bundle feature)? It wasn't mentioned in the doc whether we are allowed to do this.

The buffer ring just provided ordering of the data. You could certainly send to more than one socket from a single provided buffer group, though I'm not so sure that would make a lot of sense? Makes more sense to have a buffer group per socket where you need it.

axboe commented 2 months ago

Can we add buffers with different size into a buffered ring? Ask this because consider an echo server that uses send ring feature and multishot_read. Each read may have different size. Ideally, one would add the selected buffer returned by read into a send ring. Moreover, user would like to add only the part of the buffer that has content in it (otherwise not only it is a waste but also it is more complicated for the other end to decode it). The same situation is trivial in normal io_uring_prep_send since user can specify the size of the buffer, there's nowhere to set this size argument if we are using a send ring.

Buffers in any buffer group doesn't have a fixed size, each buffer can be any size it wants to. The buffer itself holds the size, not the group. Each buffer looks like this:

struct io_uring_buf {                                                           
        __u64   addr;                                                           
        __u32   len;                                                            
        __u16   bid;                                                            
        __u16   resv;                                                           
};

and has an addr/len/bufferid associated with it.

You could certainly have a receive buffer of size N where the first part is eg a header, then add N-hdr_size and the buffer start offset into a buffer group and have the send just do the part you want to send.

gxuu commented 2 months ago

Thanks! I will do some more experiment with them. Have a good day.

axboe commented 2 months ago

send ring won't greatly outperform the sendmsg with append vectors, but it is faster in my testing. Running a basic 100b packet test case, proxying the packets, this is what I see with sendmsg:

Bandwidth (threads=1): 23,494Mbit
Bandwidth (threads=1): 23,402Mbit
Bandwidth (threads=1): 23,338Mbit
Bandwidth (threads=1): 23,475Mbit

and here's what I see with send ring:

Bandwidth (threads=1): 24,321Mbit
Bandwidth (threads=1): 24,335Mbit
Bandwidth (threads=1): 24,354Mbit
Bandwidth (threads=1): 24,301Mbit

which is a few percent, with about 5% extra cycles spent on the basic append userspace side and kernel side iovec handling that isn't necessary with send ring, and about 2.5% extra cycles on provided buffer handling for that case. So a new win of ~2.5%, which isn't nothing.

gxuu commented 2 months ago

I hesitate to use sendmsg in my development because although currently sendmsg together with recv_multishot seems to be a better option, when new features get into the kernel, changing the code from send and recv_multishot to bundled send and bundled recv_multishot will be easier than changing the code from bundled sendmsg and bundled recv_multishot. Now, since bundled recvmsg_multishot and bundled sendmsg aren't even implemented, I would rather put my bets on bundled send and bundled recv_multishot. Which is indeed faster (according to your tweets on 100G TCP throughput :) than sendmsg+bundled recv.

BTW, in your graph (the 100G throughput one), does "send + recv bundles" referring to both bundled or mundane send + bundled recv? Also wondering why "send / recv multishot" much slower than "sendmsg / recvmsg multishot"? Is it because of the send ordering?

BTW again, we can indeed use multi-shot recv together with bundle feature, right? And we won't have multi-shot send, but only bundled send, is this correct?

Thanks

gxuu commented 2 months ago

Will we have bundled send together with zero copy feature? We don't have them now and proxy.c disables it. What does zero copy actually mean? Copying from the kernel side?

axboe commented 2 months ago

BTW, in your graph (the 100G throughput one), does "send + recv bundles" referring to both bundled or mundane send + bundled recv? Also wondering why "send / recv multishot" much slower than "sendmsg / recvmsg multishot"? Is it because of the send ordering?

Because send without send ring (or using the sendmsg append) need to be serialized, that's why it's much slower.

axboe commented 2 months ago

BTW again, we can indeed use multi-shot recv together with bundle feature, right? And we won't have multi-shot send, but only bundled send, is this correct?

As mention several times, there is no way to support multishot send. The initial send bundle used that name, but that was a bad idea. Multishot recv makes sense as new data coming in will trigger a retry, but you have no such condition on the send side. That's why they are named bundles. They will send a bundle of what is there now, but then terminate after that.

Receive and send side are totally separate, you can use whatever you want on the recv side separately from what you use on the send side. You can use recv multishot with send bundles, for example, or with just send ring. It doesn't matter, they are two separate things and managed on a per-SQE basis.

axboe commented 2 months ago

Will we have bundled send together with zero copy feature? We don't have them now and proxy.c disables it. What does zero copy actually mean? Copying from the kernel side?

Zero copy send means that rather than do what a normal send does (copy data from application into the socket buffer), it'll send directly from the application buffer. I did not implement bundles/ring support for send zerocopy, haven't looked into it yet (whether it makes sense, and to what degree).

Send zerocopy can work well with larger buffer sizes, which will offset the cost of the mappings needed to do zero copy send. 6.10 will have some improvements there, bringing the crossover point to something like 3-4k send packet sizes. Below that size non-zerocopy send will most likely be faster, at or above that size send zerocopy will be faster. This all sepends on the system and setup, those are just rough guidelines.

gxuu commented 2 months ago

Thanks. Perhaps I will stick to send and multishot_recv then, and fruit from the performance gain of bundle in future kernel releases. This seems to be the easiest choice from the current point of view.

pyhd commented 2 months ago

@axboe

That said, I did recently implement provided buffer support for sends, as it provides a way to serialize sends (and other things, like making sends more efficient as you can pack multiple buffers into a single send). But for write itself, not sure I see the use case. It'd obviously be trivial to add.

I think serialization may still be useful for write(), which in theory could return EAGAIN. Provided the N write operations were executed in order, then the userspace could simply flag 1~N-1 sqes with SKIP_SUCCESS, which means only the last batched cqe.