Closed hedgefund closed 2 years ago
Would this be more useful as a setup flag? I would imagine that it's mostly a per app kind of thing, if you prefer to have everything submitted and not stop on submit errors. Could be IORING_SETUP_ENTER_ALL or something like that. Would also make it easier to retrofit in applications and liburing, as we don't need to change anything but the setup call.
Yeah a setup flag would work great for this.
Becomes really trivial then:
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8a53af8e5fe2..b92fbc1dc932 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6458,6 +6458,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
fail_req:
io_put_req(req);
io_req_complete(req, err);
+ if (ctx->flags & IORING_SETUP_ENTER_ALL)
+ continue;
break;
}
@@ -8613,7 +8615,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
- IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ))
+ IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
+ IORING_SETUP_ENTER_ALL))
return -EINVAL;
return io_uring_create(entries, &p, params);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d65fde732518..9e830b760ffd 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -95,6 +95,7 @@ enum {
#define IORING_SETUP_CQSIZE (1U << 3) /* app defines CQ size */
#define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */
#define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */
+#define IORING_SETUP_ENTER_ALL (1U << 6) /* don't stop submit on error */
enum {
IORING_OP_NOP,
I'll post the patch for 5.10 - I generally like to provide attribution to patches, either for reporting bugs but also for suggesting items. If you'd like me to include that, please tell me your real name and email and I'll add it.
@hedgefund, what is your use case? Requests are usually meant to not fail or do it rarely. Do you do a bunch of recvmsg()'s or something like that?
Totally agree on that, you're generally doing something wrong if random errors are expected. That said, it does mean you don't have to worry about submit looping if you are expecting rare errors, so it simplifies the user space code. And I'm always a big fan of that.
@isilence I erroneously had a bunch of recv()
requests queued for the same TCP socket with the intent to have the kernel filling my buffers while I'm processing already-received data. I had assumed it works like IOCP where in a single thread situation all the queued recv()
for the same socket will complete in the order that data arrived on the socket (so you could have the 5th recv()
request complete before the 1st in the CQ but that's OK because all that matters is the data ordering for processing). But turns out that with io_uring the requests will complete out of order even wrt the data ordering on the socket itself so you can only ever have 1 outstanding recv()
per socket if you want to process incoming data in order, so I have to revise my plan.
BTW, is there a way to achieve that? having requests always pending with recv()
so that while I'm processing previously received data the kernel will already be filling my next buffer. On Windows with IOCP you simple queue a bunch of WSARecv()
on the same socket, and assuming you are only reaping completions from a single thread, they complete in the order which data arrived.
I'm aware of the IOSQE_IO_LINK
flag but in this situation it's not a good fit; while I could use that flag on the initial submission of buffers right after the socket is connected (say, 8 x recv()
submissions all linked together) and that will achieve what I want for that specific batch, but the problem is that I'm always re-posting buffers that completed while data is being received, and they would be out-of-order with the initial batch of posted buffers.
@isilence I erroneously had a bunch of
recv()
requests queued for the same TCP socket with the intent to have the kernel filling my buffers while I'm processing already-received data. I had assumed it works like IOCP where in a single thread situation all the queuedrecv()
for the same socket will complete in the order that data arrived on the socket (so you could have the 5threcv()
request complete before the 1st in the CQ but that's OK because all that matters is the data ordering for processing). But turns out that with io_uring the requests will complete out of order even wrt the data ordering on the socket itself so you can only ever have 1 outstandingrecv()
per socket if you want to process incoming data in order, so I have to revise my plan.
I see. Do we need the feature then or not?
BTW, is there a way to achieve that? having requests always pending with
recv()
so that while I'm processing previously received data the kernel will already be filling my next buffer. On Windows with IOCP you simple queue a bunch ofWSARecv()
on the same socket, and assuming you are only reaping completions from a single thread, they complete in the order which data arrived.
Not currently with io_uring. Even if there is some magic in TCP stack for that, it'd be
hard to use because io_uring may reorder requests. Sounds like a feature request.
Would you mind sending a proposal to io-uring at vger.kernel.org
?
I'm aware of the
IOSQE_IO_LINK
flag but in this situation it's not a good fit; while I could use that flag on the initial submission of buffers right after the socket is connected (say, 8 xrecv()
submissions all linked together) and that will achieve what I want for that specific batch, but the problem is that I'm always re-posting buffers that completed while data is being received, and they would be out-of-order with the initial batch of posted buffers.
I see. Do we need the feature then or not?
I think this feature is useful regardless of my own issues; it simplifies the client API and it makes more sense IMO in a large fully-async application. Usually in these situations the place that called io_uring_enter
can't do anything with the information that only some of the SQEs were submitted because of an error; it will just call io_uring_enter
again until they are all submitted and let the submitters of the SQEs handle the errors as they see fit.
Concrete example from my experiments with io_uring: when you submit a request to the I/O abstraction layer you provide a callback function/lambda, the io_uring abstraction puts the callback pointer inside user_data
and when it reaps completions it simply calls the callback inside user_data
for each CQE with the res
of that completion as an argument; inside the I/O layer that wraps io_uring you have no idea who submitted something, what kind of request they submitted or what they plan to do with/how they handle their errors. This is a popular way to design async I/O systems and in that situation it makes no sense to stop submissions because of errors.
Not currently with io_uring. Even if there is some magic in TCP stack for that, it'd be hard to use because io_uring may reorder requests. Sounds like a feature request. Would you mind sending a proposal to io-uring at vger.kernel.org?
Can you give me some guidance on what I need to include/detail in such a proposal? I had one more feature request that has to do with registering/unregistering buffers; in short:
IORING_OP_FILES_UPDATE
@hedgefund, I'm a bit late, my apologies.
Nothing of formal of course, that's just to gather more audience + kernel development is there. Just copy-paste what you already have here and add [RFC] to the subject. If proposals are unrelated, that'd be great if you send them separately. All ideas, suggestions, proposals and so on are very welcome!
I'll post the patch for 5.10
Did this make it in, perhaps with a different flag name?
I'd certainly use this feature if it does get added. My application is a language runtime, so it's exactly the use case that the original reporter had: central dispatch in the runtime system, with I/O errors handled upon I/O completion by the individual application-level use sites.
Same here - it would have avoided #309
The purpose of this flag is to submit all the provided SQEs without stopping on the first SQE-related error when you call
io_uring_enter
. The per-SQE error report mechanism is asynchronous anyway so there is no point in stopping the submission of SQEs because in a large async application the point of error-handling for that SQE is not whereio_uring_enter
is called; the user of the ring which queued the SQE will handle the error and io_uring_enter is usually called in a central place that is basically the IO port/ring/loop abstraction.Right now what you have to do is repeatedly called
io_uring_enter
in a loop until it submits all your SQEs every time it stops because of an SQE error, because you want to handle the errors at the point where you queued the SQE which is usually inside the code of one of the users of the ring. This flag will basically compress all those wasted syscalls into one.