Add support for absolute timeouts to io_uring_getevents_arg-based io_uring_enter calls

lewissbaker commented 4 months ago

I have been working on implementing an io_uring-based execution context with support for timers where I manage a priority-queue of user-provided timers and compute the earliest due time at which I have work scheduled to run.

Ideally, I would like to be able to call io_uring_enter2() and have that block until either I have a completion-event to process or the earliest due time has elapsed.

Currently, the io_uring_getevents_arg structure seems to require passing a relative time.

While the documentation does not specify whether the time is relative or not, looking at the implementation, the io_cqring_wait() function seems to be adding the current kernel time to the value passed.

While I can convert the absolute time I have to a relative time by calling clock_gettime() just before calling io_uring_enter2(), this approach has a couple of limitations.

It seems like additional overhead - both user-space and the kernel need to ask for the current time to convert to/from a relative time.
It can also be less accurate than passing an absolute time - if the thread's time-slice ends between the calls to clock_gettime() and io_uring_enter2() then the computed relative timeout can be an over-estimate and can result in additional delay to the io_uring_enter2() call returning.

Would it be possible to add support for passing an absolute timeout time to the io_uring_enter2() syscall or to the io_uring_wait_cqe_timeout() or io_uring_submit_and_wait_timeout() functions? Ideally, with the ability to specify which clock to use (e.g. CLOCK_BOOTTIME or CLOCK_MONOTONIC).

Or am I better off trying to use the IORING_OP_TIMEOUT op-code for this use-case?

isilence commented 4 months ago

Or am I better off trying to use the IORING_OP_TIMEOUT op-code for this use-case?

That's a bad option

While the documentation does not specify whether the time is relative or not, looking at the implementation, the io_cqring_wait() function seems to be adding the current kernel time to the value passed.

right, it's relative

Would it be possible to add support for passing an absolute timeout time to the io_uring_enter2() syscall or to the io_uring_wait_cqe_timeout() or io_uring_submit_and_wait_timeout() functions?

I can take a look, it's easy to add a flag telling whether it's relative or not

io_uring_wait_cqe_timeout() or io_uring_submit_and_wait_timeout() functions? Ideally, with the ability to specify which clock to use (e.g. CLOCK_BOOTTIME or CLOCK_MONOTONIC).

This one might be more complicated to fit in. Maybe it should be a ring global option set separately via the io_uring register syscall, i.e. if you request the waiting syscall timeout's to be in abs mode, than we'll use that registered beforehand value to decide what clock mode it should use. I can't imagine that an app would be switching b/w abs modes at runtime.

isilence commented 4 months ago

Or am I better off trying to use the IORING_OP_TIMEOUT op-code for this use-case?

That's a bad option

Actually, after re reading your use case I take it back, that's what OP_TIMEOUT is there for. You also have multishot timeouts if that works with you, you queue just one request and it'll produce a CQE each time the required interval passes.

The wait argument might be faster though in some cases, so the question is what the performance looks like in your app comparing two options (while simulating abs through relative modes)?

lewissbaker commented 4 months ago

Or am I better off trying to use the IORING_OP_TIMEOUT op-code for this use-case?

That's a bad option

Actually, after re reading your use case I take it back, that's what OP_TIMEOUT is there for.

If I queue an OP_TIMEOUT with a count of 1 and an absolute due-time and then call io_uring_enter(), am I guaranteed that when the io_uring_enter() call returns that the CQE for the OP_TIMEOUT operation will be present in the completion-queue?

You also have multishot timeouts if that works with you, you queue just one request and it'll produce a CQE each time the required interval passes.

A multishot timeout doesn't work for my use-case. I have a whole bunch of single-shot tasks that need to be executed at specific times. I compute the earliest such time and when I don't have anything else to do I want to wait for events until that time arrives. Once the task(s) scheduled for that time have been executed I look at the next earliest time in the queue and, when idle, I want to wait for events until that next earliest time arrives. The times are not necessarily periodic.

The wait argument might be faster though in some cases, so the question is what the performance looks like in your app comparing two options (while simulating abs through relative modes)?

I'll try to get some measurements for you.

isilence commented 4 months ago

Actually, after re reading your use case I take it back, that's what OP_TIMEOUT is there for.

If I queue an OP_TIMEOUT with a count of 1 and an absolute due-time and then call io_uring_enter(), am I guaranteed that when the io_uring_enter() call returns that the CQE for the OP_TIMEOUT operation will be present in the completion-queue?

Please don't, count is unofficially deprecated, there is no way to use it reliable and there are all sorts of problems with it.

With that said, I don't see what you want to achieve by using it. OP_TIMEOUT is a normal request, once completed it'll post a CQE. If you wait(nr=1), once the timeout completes you have enough CQEs to satisfy the wait condition, which will force the syscall to return back to user space.

IIRC there was a hack breaking the waiting loop if there is at least one timeout completed regardless of the nr you pass to waiting, but I need to double check and it's probably unreliable.

lewissbaker commented 4 months ago

With that said, I don't see what you want to achieve by using it. OP_TIMEOUT is a normal request, once completed it'll post a CQE.

One case I have in mind is where I currently have an earliest scheduled time that is, say, 5s in the future (T+5) and I don't have anything else to do until either an I/O completes or that time 5s in the future arrives, so I am blocked in io_uring_enter() waiting for at least one CQE.

Then I/O completion-event arrives, say at T+0.5, and io_uring_enter() returns, I process the completion-event and it maybe issues some more I/O requests but also schedules a new task to run at T+2. This means the new earliest time is now earlier than the previous OP_TIMEOUT request. So I need to cancel the old OP_TIMEOUT by submitting a new OP_TIMEOUT_REMOVE request and then issue a new OP_TIMEOUT request with the new time, presumably with the IOSQE_IO_HARDLINK flag so that I can reuse the same user_data value.

But since the OP_TIMEOUT_REMOVE request is likely (guaranteed?) to complete synchronously, the next io_uring_enter() will immediately put a CQE for the OP_TIMEOUT_REMOVE op and the existing OP_TIMEOUT request into the completion-queue and return. I then need to process these CQE and call io_uring_enter() again to actually start waiting on the new OP_TIMEOUT time.

So this ends up needing two io_uring_enter() syscalls to set a new due time and wait.

The goal of setting the count to be 1 on the OP_TIMEOUT operation was to have that OP_TIMEOUT implicitly cancelled and completed when io_uring_enter() returns early because a completion-event arrives. This would allow me to issue a new OP_TIMEOUT with a new due-time on the next call to io_uring_enter and then wait for that new timeout with a single syscall.

I only just noticed that OP_TIMEOUT_REMOVE has an optional IORING_TIMEOUT_UPDATE flag that can be used to update the existing OP_TIMEOUT rather than having to cancel the existing OP_TIMEOUT and issue a new one. I will look at trying to use this in conjunction with IOSQE_CQE_SKIP_SUCCESS to avoid immediately posting a CQE whenever the timeout is updated.

Regardless, I think having the ability to pass an absolute time into io_uring_getevents_arg would be a better fit for this usage pattern. It would avoid the complexities needed to manage submitting, updating, cancelling OP_TIMEOUT operations, with the inherent races involved (e.g. an update to a later time might fail with -EBUSY, requiring issuing a new OP_TIMEOUT) and care needed to manage lifetimes of the __kernel_timespec objects needed to issue successive updates in the presence of IORING_SETUP_SQPOLL where submission is asynchronous (I can't reuse the same __kernel_timespec object to issue a subsequent update until the SQE for the last OP_TIMEOUT[_REMOVE] has been consumed).

lewissbaker commented 4 months ago

For an example of some of the challenges using OP_TIMEOUT, see #1164

isilence commented 4 months ago

The goal of setting the count to be 1 on the OP_TIMEOUT operation was to have that OP_TIMEOUT implicitly cancelled and completed when io_uring_enter() returns early because a completion-event arrives. This would allow me to issue a new OP_TIMEOUT with a new due-time on the next call to io_uring_enter and then wait for that new timeout with a single syscall.

That's an interesting use case, but still please avoid using that count feature. Unfortunately, it's racy, and the timeout might get stuck.

I only just noticed that OP_TIMEOUT_REMOVE has an optional IORING_TIMEOUT_UPDATE flag that can be used to update the existing OP_TIMEOUT rather than having to cancel the existing OP_TIMEOUT and issue a new one. I will look at trying to use this in conjunction with IOSQE_CQE_SKIP_SUCCESS to avoid immediately posting a CQE whenever the timeout is updated.

Not sure how SKIP_SUCCESS fits here, the updated request is not going to post a completion until the new timeout expires, assuming update went well. The update/remove request will post a CQE though, when it's either completed updating or failed to find the timeout request.

Regardless, I think having the ability to pass an absolute time into io_uring_getevents_arg would be a better fit for this usage pattern. It would avoid the complexities needed to manage submitting, updating, cancelling OP_TIMEOUT operations, with the inherent races involved (e.g. an update to a later time might fail with -EBUSY, requiring issuing a new OP_TIMEOUT) and care needed to manage lifetimes of the __kernel_timespec objects needed to issue successive updates in the presence of IORING_SETUP_SQPOLL where submission is asynchronous (I can't reuse the same __kernel_timespec object to issue a subsequent update until the SQE for the last OP_TIMEOUT[_REMOVE] has been consumed).

Fair enough, I'll take a look

lewissbaker commented 3 months ago

I only just noticed that OP_TIMEOUT_REMOVE has an optional IORING_TIMEOUT_UPDATE flag that can be used to update the existing OP_TIMEOUT rather than having to cancel the existing OP_TIMEOUT and issue a new one. I will look at trying to use this in conjunction with IOSQE_CQE_SKIP_SUCCESS to avoid immediately posting a CQE whenever the timeout is updated.

Not sure how SKIP_SUCCESS fits here, the updated request is not going to post a completion until the new timeout expires, assuming update went well. The update/remove request will post a CQE though, when it's either completed updating or failed to find the timeout request.

The idea was to apply the SKIP_SUCCESS flag to the update request so that it would not immediately post a CQE and cause the io_uring_enter() call to immediately return to process that CQE and then have to make a second io_uring_enter() call to actually put the thread to sleep until the due time.

As you say, it would still post a CQE in the case that the update failed, but in this case you probably wanted to wake up due to the timer elapsing anyway.

isilence commented 3 weeks ago

Merged, should be in 6.12 when it comes out

https://lore.kernel.org/io-uring/cover.1723039801.git.asml.silence@gmail.com/

axboe / liburing

Add support for absolute timeouts to io_uring_getevents_arg-based io_uring_enter calls #1162