Closed lewissbaker closed 3 weeks ago
Or am I better off trying to use the IORING_OP_TIMEOUT op-code for this use-case?
That's a bad option
While the documentation does not specify whether the time is relative or not, looking at the implementation, the io_cqring_wait() function seems to be adding the current kernel time to the value passed.
right, it's relative
Would it be possible to add support for passing an absolute timeout time to the io_uring_enter2() syscall or to the io_uring_wait_cqe_timeout() or io_uring_submit_and_wait_timeout() functions?
I can take a look, it's easy to add a flag telling whether it's relative or not
io_uring_wait_cqe_timeout() or io_uring_submit_and_wait_timeout() functions? Ideally, with the ability to specify which clock to use (e.g. CLOCK_BOOTTIME or CLOCK_MONOTONIC).
This one might be more complicated to fit in. Maybe it should be a ring global option set separately via the io_uring register syscall, i.e. if you request the waiting syscall timeout's to be in abs mode, than we'll use that registered beforehand value to decide what clock mode it should use. I can't imagine that an app would be switching b/w abs modes at runtime.
Or am I better off trying to use the IORING_OP_TIMEOUT op-code for this use-case?
That's a bad option
Actually, after re reading your use case I take it back, that's what OP_TIMEOUT is there for. You also have multishot timeouts if that works with you, you queue just one request and it'll produce a CQE each time the required interval passes.
The wait argument might be faster though in some cases, so the question is what the performance looks like in your app comparing two options (while simulating abs through relative modes)?
Or am I better off trying to use the IORING_OP_TIMEOUT op-code for this use-case?
That's a bad option
Actually, after re reading your use case I take it back, that's what OP_TIMEOUT is there for.
If I queue an OP_TIMEOUT with a count
of 1 and an absolute due-time and then call io_uring_enter()
, am I guaranteed that when the io_uring_enter()
call returns that the CQE for the OP_TIMEOUT
operation will be present in the completion-queue?
You also have multishot timeouts if that works with you, you queue just one request and it'll produce a CQE each time the required interval passes.
A multishot timeout doesn't work for my use-case. I have a whole bunch of single-shot tasks that need to be executed at specific times. I compute the earliest such time and when I don't have anything else to do I want to wait for events until that time arrives. Once the task(s) scheduled for that time have been executed I look at the next earliest time in the queue and, when idle, I want to wait for events until that next earliest time arrives. The times are not necessarily periodic.
The wait argument might be faster though in some cases, so the question is what the performance looks like in your app comparing two options (while simulating abs through relative modes)?
I'll try to get some measurements for you.
Actually, after re reading your use case I take it back, that's what OP_TIMEOUT is there for.
If I queue an OP_TIMEOUT with a
count
of 1 and an absolute due-time and then callio_uring_enter()
, am I guaranteed that when theio_uring_enter()
call returns that the CQE for theOP_TIMEOUT
operation will be present in the completion-queue?
Please don't, count
is unofficially deprecated, there is no way to use it reliable and there are all sorts of problems with it.
With that said, I don't see what you want to achieve by using it. OP_TIMEOUT
is a normal request, once completed it'll post a CQE. If you wait(nr=1), once the timeout completes you have enough CQEs to satisfy the wait condition, which will force the syscall to return back to user space.
IIRC there was a hack breaking the waiting loop if there is at least one timeout completed regardless of the nr you pass to waiting, but I need to double check and it's probably unreliable.
With that said, I don't see what you want to achieve by using it. OP_TIMEOUT is a normal request, once completed it'll post a CQE.
One case I have in mind is where I currently have an earliest scheduled time that is, say, 5s in the future (T+5) and I don't have anything else to do until either an I/O completes or that time 5s in the future arrives, so I am blocked in io_uring_enter() waiting for at least one CQE.
Then I/O completion-event arrives, say at T+0.5, and io_uring_enter()
returns, I process the completion-event and it maybe issues some more I/O requests but also schedules a new task to run at T+2. This means the new earliest time is now earlier than the previous OP_TIMEOUT request. So I need to cancel the old OP_TIMEOUT by submitting a new OP_TIMEOUT_REMOVE request and then issue a new OP_TIMEOUT request with the new time, presumably with the IOSQE_IO_HARDLINK flag so that I can reuse the same user_data
value.
But since the OP_TIMEOUT_REMOVE
request is likely (guaranteed?) to complete synchronously, the next io_uring_enter()
will immediately put a CQE for the OP_TIMEOUT_REMOVE
op and the existing OP_TIMEOUT
request into the completion-queue and return. I then need to process these CQE and call io_uring_enter()
again to actually start waiting on the new OP_TIMEOUT
time.
So this ends up needing two io_uring_enter()
syscalls to set a new due time and wait.
The goal of setting the count
to be 1
on the OP_TIMEOUT
operation was to have that OP_TIMEOUT
implicitly cancelled and completed when io_uring_enter()
returns early because a completion-event arrives. This would allow me to issue a new OP_TIMEOUT
with a new due-time on the next call to io_uring_enter
and then wait for that new timeout with a single syscall.
I only just noticed that OP_TIMEOUT_REMOVE
has an optional IORING_TIMEOUT_UPDATE
flag that can be used to update the existing OP_TIMEOUT
rather than having to cancel the existing OP_TIMEOUT
and issue a new one. I will look at trying to use this in conjunction with IOSQE_CQE_SKIP_SUCCESS
to avoid immediately posting a CQE whenever the timeout is updated.
Regardless, I think having the ability to pass an absolute time into io_uring_getevents_arg
would be a better fit for this usage pattern. It would avoid the complexities needed to manage submitting, updating, cancelling OP_TIMEOUT
operations, with the inherent races involved (e.g. an update to a later time might fail with -EBUSY
, requiring issuing a new OP_TIMEOUT
) and care needed to manage lifetimes of the __kernel_timespec
objects needed to issue successive updates in the presence of IORING_SETUP_SQPOLL
where submission is asynchronous (I can't reuse the same __kernel_timespec
object to issue a subsequent update until the SQE for the last OP_TIMEOUT[_REMOVE]
has been consumed).
For an example of some of the challenges using OP_TIMEOUT, see #1164
The goal of setting the
count
to be1
on theOP_TIMEOUT
operation was to have thatOP_TIMEOUT
implicitly cancelled and completed whenio_uring_enter()
returns early because a completion-event arrives. This would allow me to issue a newOP_TIMEOUT
with a new due-time on the next call toio_uring_enter
and then wait for that new timeout with a single syscall.
That's an interesting use case, but still please avoid using that count feature. Unfortunately, it's racy, and the timeout might get stuck.
I only just noticed that
OP_TIMEOUT_REMOVE
has an optionalIORING_TIMEOUT_UPDATE
flag that can be used to update the existingOP_TIMEOUT
rather than having to cancel the existingOP_TIMEOUT
and issue a new one. I will look at trying to use this in conjunction withIOSQE_CQE_SKIP_SUCCESS
to avoid immediately posting a CQE whenever the timeout is updated.
Not sure how SKIP_SUCCESS
fits here, the updated request is not going to post a completion until the new timeout expires, assuming update went well. The update/remove request will post a CQE though, when it's either completed updating or failed to find the timeout request.
Regardless, I think having the ability to pass an absolute time into
io_uring_getevents_arg
would be a better fit for this usage pattern. It would avoid the complexities needed to manage submitting, updating, cancellingOP_TIMEOUT
operations, with the inherent races involved (e.g. an update to a later time might fail with-EBUSY
, requiring issuing a newOP_TIMEOUT
) and care needed to manage lifetimes of the__kernel_timespec
objects needed to issue successive updates in the presence ofIORING_SETUP_SQPOLL
where submission is asynchronous (I can't reuse the same__kernel_timespec
object to issue a subsequent update until the SQE for the lastOP_TIMEOUT[_REMOVE]
has been consumed).
Fair enough, I'll take a look
I only just noticed that
OP_TIMEOUT_REMOVE
has an optionalIORING_TIMEOUT_UPDATE
flag that can be used to update the existingOP_TIMEOUT
rather than having to cancel the existingOP_TIMEOUT
and issue a new one. I will look at trying to use this in conjunction withIOSQE_CQE_SKIP_SUCCESS
to avoid immediately posting a CQE whenever the timeout is updated.Not sure how
SKIP_SUCCESS
fits here, the updated request is not going to post a completion until the new timeout expires, assuming update went well. The update/remove request will post a CQE though, when it's either completed updating or failed to find the timeout request.
The idea was to apply the SKIP_SUCCESS
flag to the update request so that it would not immediately post a CQE and cause the io_uring_enter() call to immediately return to process that CQE and then have to make a second io_uring_enter() call to actually put the thread to sleep until the due time.
As you say, it would still post a CQE in the case that the update failed, but in this case you probably wanted to wake up due to the timer elapsing anyway.
Merged, should be in 6.12 when it comes out
https://lore.kernel.org/io-uring/cover.1723039801.git.asml.silence@gmail.com/
I have been working on implementing an io_uring-based execution context with support for timers where I manage a priority-queue of user-provided timers and compute the earliest due time at which I have work scheduled to run.
Ideally, I would like to be able to call
io_uring_enter2()
and have that block until either I have a completion-event to process or the earliest due time has elapsed.Currently, the
io_uring_getevents_arg
structure seems to require passing a relative time.While the documentation does not specify whether the time is relative or not, looking at the implementation, the
io_cqring_wait()
function seems to be adding the current kernel time to the value passed.While I can convert the absolute time I have to a relative time by calling
clock_gettime()
just before callingio_uring_enter2()
, this approach has a couple of limitations.clock_gettime()
andio_uring_enter2()
then the computed relative timeout can be an over-estimate and can result in additional delay to theio_uring_enter2()
call returning.Would it be possible to add support for passing an absolute timeout time to the
io_uring_enter2()
syscall or to theio_uring_wait_cqe_timeout()
orio_uring_submit_and_wait_timeout()
functions? Ideally, with the ability to specify which clock to use (e.g.CLOCK_BOOTTIME
orCLOCK_MONOTONIC
).Or am I better off trying to use the
IORING_OP_TIMEOUT
op-code for this use-case?