[Question] Polling latency of IORING_OP_POLL_ADD vs epoll_wait.

vsolontsov-ll commented 4 months ago

Hi, Edit: Accurate test showed no issues with polling a file descriptor via io_uring vs epoll. The only difference is in busy-poll mode (io_uring_peek_cqe() or epoll_wait(0) in a tight loop), where epoll allows to gain some ticks.

Original text: It seems we are on the very first and quite usual phase of migrating system from epoll to io_uring, I just transformed our reactor component for using io_uring with IORING_OP_POLL_ADD. Surprisingly I found that the time from writing to a pipe (I tested with eventfd too -- no difference) till waking up from io_uring_wait_cqes() is way higher than the same for epoll_wait().

In other words producing thread takes a timestamp t0 right before writing to an fd. Consuming thread hangs on io_uring_wait_cqes(..., 1sec) with MULTISHOT POLL armed (or on epoll_wait()). Once it gets cqe with EPOLLIN, it takes timestamp t1 and reads from the fd. (t1-t0) deltas are collected.

Registering the file descriptor didn't help.

The average values in nsec are

uring: 1700-2100 (depending on uring initialization flags)
epoll: 650-670 In busy-poll scenario (epoll_wait(0) and no syscalls for uring) looks better: uring goes down to 900-1100 usec.

So I'm wondering if this is a bad scenario for io_uring by design or it can be improved somehow. I'll be happy to provide a clean benchmark if needed.

NB I constantly set IORING_SETUP_SINGLE_ISSUER and played withIORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_DEFER_TASKRUN. `IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG improves the busy-wait scenario while makes a bit worse the timed-wait one.

axboe commented 4 months ago

Interesting! Should be easy to write a reproducer for this. Do you have one? Would save me the time. There's absolutely no reason why those too should differ, and 1700-2100 usec sounds pretty out there in terms of timing.

What kernel are you on?

vsolontsov-ll commented 4 months ago

Interesting! Should be easy to write a reproducer for this. Do you have one? Would save me the time. There's absolutely no reason why those too should differ, and 1700-2100 usec sounds pretty out there in terms of timing.

I don't have a clean one which I can provide. I will prepare one on the weekend.

What kernel are you on?

It's 6.1.

axboe commented 4 months ago

I just wrote my own, it was a 2 min job. From initial results, I see ~100 usec, which seems a lot more reasonable. That's with a thread writing a timestamp every 100 msec, and the main thread having a multishot poll and then doing a read(2) based on that trigger and dumping how long the time was since the thread wrote the timeval and until we read it. So that includes the pipe write time, and the pipe read time as well in those 100 usec.

Since yours is an order of magnitude higher than that, that seems odd. I'll poke a bit... Maybe it's the wait_cqes part. I'll need to see your actual code as it seems fine here.

axboe commented 4 months ago

Same results using io_uring_wait_cqes(ring, &cqe, 1, &ts, NULL) with ts having tv_sec = 1.

axboe commented 4 months ago

I didn't reproduce, as you can read above it's not showing any difference, and the results are not showing a difference and are much more inline with what I'd expect. Yours seem really slow. So a reproducer would be nice, but mostly because I suspect you're doing something wrong.

vsolontsov-ll commented 4 months ago

My bad -- my figures are in nanoseconds. I originally wrote the benchmark in TSC but then moved to clock_gettime() and nonos. Sorry for confusion, I can imagine your thoughts about 1msec latency.

axboe commented 4 months ago

Ahhh nanoseconds makes a lot more sense! Here's my test case:

axboe@intel ~> ./pipe-lat-bench 1 100
Using epoll, wait usec 100
Avg wakeup time: 6
axboe@intel ~> ./pipe-lat-bench 0 100
Using io_uring, wait usec 100
Avg wakeup time: 6

As mentioned, my numbers include the pipe write time and pipe read time.

This is really a good use case for read multishot, using io_uring as an epoll replacement just as a notification mechanism isn't going to bring you any performance improvements, it'll be about the same performance for that. But it should be about the same performance.

axboe commented 4 months ago

In any case, once you make a clean reproducer over the weekend, send it my way and I'll take a look and see what's going on.

vsolontsov-ll commented 4 months ago

Beg me pardon for the false alert -- it looked obvious on my end((.

So... The clean reproducer showed that the only potentially meaningful difference between io_uring and epoll latency for basic polling of an FD is in busy-poll mode. I think it's explainable -- to do the taskwork it needs either an interrupt or a syscall. And probably that's why the flags IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG look attractive in this mode. (io_uring in busy-poll is perfect when you have primary communication channels via shm rings while still needs to poll a number of FDs.)

On my end that was:

Too-good-to-be-true results of epoll with 1 second timeout in my original benchmark was just wrong. Because of a recent bug in the reactor, epoll was permanently in busy-wait mode.
There's still something wrong with my original benchmark. The figure I get are different and epoll permanently wins 10-20% (100-200nsec).

Anyway, here is the benchmark: poll-bench-mt.c.gz

My local results are (i9-12900F)

vsol@vsol-deb:~/repo/liburing/examples$ ./poll-bench-mt
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   4058    3773    4876    7336
io_uring busy-poll latency.         results. avg/med/%90/%99:   1988    1936    2110    2827
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_DEFER_TASKRUN
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   4282    3993    5324    7280
io_uring busy-poll latency.         results. avg/med/%90/%99:   1505    1479    1671    2446
===
epoll poll latency. timeout=1s.     results. avg/med/%90/%99:   4331    4035    5448    7575
epoll busy-poll latency.            results. avg/med/%90/%99:   1316    1294    1499    1986

Edit And just for history -- the same benchmark with pinning producer and consumer on isolated "tickless" cores (i9-12900F, hyperthreading siblings are off):

vsol@vsol-deb:~/repo/liburing/examples$ CORE_READER=4 CORE_WRITER=6 ./poll-bench-mt
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   1803    1768    1938    2501
io_uring busy-poll latency.         results. avg/med/%90/%99:   1050    1014    1197    1630
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_DEFER_TASKRUN
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   1940    1872    2233    2846
io_uring busy-poll latency.         results. avg/med/%90/%99:   814 782 968 1375
===
epoll poll latency. timeout=1s.     results. avg/med/%90/%99:   1689    1644    1828    2412
epoll busy-poll latency.            results. avg/med/%90/%99:   675 647 799 1248

vsolontsov-ll commented 4 months ago

I assume it can be closed now

axboe commented 4 months ago

Yep all good, thanks for providing the test, I'll try and run it here when time permits. One note - your DEFER_TASKRUN results aren't valid, it doesn't make sense to setup with COOP_TASKRUN and DEFER_TASKRUN. You really want the 2nd test to be COOP_TASKRUN | TASKRUN_FLAG, and the 3rd addition then to be DEFER_TASKRUN | SINGLE_ISSUER. One more note on the latter, it won't work straight up with just peak, as you need to process events. The busy poll loop would need an io_uring_get_events() call as well for DEFER_TASKRUN.

axboe commented 4 months ago

With those tweaks, here's what my AMD test box says (7763):

axboe@r7525 ~> ./poll-bench-mt                                                 0.166s
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   33452   33803   3742039985
io_uring busy-poll latency.         results. avg/med/%90/%99:   1706    1493    20935891
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   32776   32672   3723038823
io_uring busy-poll latency.         results. avg/med/%90/%99:   3348    1253    1844189365
===
io_uring with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   31663   32190   3552740446
io_uring busy-poll latency.         results. avg/med/%90/%99:   1388    1222    16934689
===
epoll poll latency. timeout=1s.     results. avg/med/%90/%99:   32138   32131   3585766975
epoll busy-poll latency.            results. avg/med/%90/%99:   1386    1243    16034899
axboe@r7525 ~> CORE_READER=0 CORE_WRITER=0 ./poll-bench-mt                    40.529s
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   9258    9277    1024916090
io_uring busy-poll latency.         results. avg/med/%90/%99:   2301    1432    301557619
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   10734   9728    1032966485
io_uring busy-poll latency.         results. avg/med/%90/%99:   2386    1393    233471283
===
io_uring with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   9770    9708    1023930928
io_uring busy-poll latency.         results. avg/med/%90/%99:   2277    1413    227464171
===
epoll poll latency. timeout=1s.     results. avg/med/%90/%99:   9956    9718    1108134785
epoll busy-poll latency.            results. avg/med/%90/%99:   2241    1422    227561896

and here's what my intel (8458P) does:

axboe@intel ~> ./poll-bench-mt
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   92842   102937  107092  152077
io_uring busy-poll latency.         results. avg/med/%90/%99:   2527    2572    26433435
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   91478   101717  106511  132096
io_uring busy-poll latency.         results. avg/med/%90/%99:   2064    2040    21302921
===
io_uring with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   98114   102265  107650  163254
io_uring busy-poll latency.         results. avg/med/%90/%99:   2060    2057    21202278
===
epoll poll latency. timeout=1s.     results. avg/med/%90/%99:   92528   101507  108178  251974
epoll busy-poll latency.            results. avg/med/%90/%99:   1951    1952    20062114
axboe@intel ~> CORE_READER=0 CORE_WRITER=0 ./poll-bench-mt
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   9486    9437    981110366
io_uring busy-poll latency.         results. avg/med/%90/%99:   1041    959 10557530
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   10094   9705    994840330
io_uring busy-poll latency.         results. avg/med/%90/%99:   1019    939 10476876
===
io_uring with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER
io_uring poll latency. timeout=1s.  results. avg/med/%90/%99:   10088   9719    987236190
io_uring busy-poll latency.         results. avg/med/%90/%99:   1009    944 10336515
===
epoll poll latency. timeout=1s.     results. avg/med/%90/%99:   9578    9443    967026129
epoll busy-poll latency.            results. avg/med/%90/%99:   1108    1052    11375969

which is pretty close. Pipes aren't generally a good benchmark, unless you really are using pipes of course, as the wake side doesn't really mimic network workloads in general due to how wakeups are done. This is apparent in just pinning both consumer and provider on the same CPU is obviously vastly faster.

Worth noting that even with pinning, the results don't seem that stable. So take the above with a grain of salt!

vsolontsov-ll commented 4 months ago

it doesn't make sense to setup with COOP_TASKRUN and DEFER_TASKRUN

Ah, thanks, good to know! Just to clarify, does the COOP_TASKRUN | TASKRUN_FLAG disable IPIs and still allow the default processing of the taskwork in all syscalls? If there a combination which allows me to get a TSKRUN_FLAG, so I know that I got to do io_uring_enter(IORING_ENTER_GETEVENTS) and keep other syscalls clean from the extra work?

axboe commented 4 months ago

Yeah, you can add TASKRUN_FLAG with SINGLE_ISSUER|DEFER_TASKRUN and get that.

axboe commented 4 months ago

Just to clarify, does the COOP_TASKRUN | TASKRUN_FLAG disable IPIs and still allow the default processing of the taskwork in all syscalls?

Yep

vsolontsov-ll commented 4 months ago

Yeah, you can add TASKRUN_FLAG with SINGLE_ISSUER|DEFER_TASKRUN and get that.

and there will be no interrupts?

axboe commented 4 months ago

Just to clarify, in order of signaling:

Default is signal based, which will both force wake the task (with IPI, if needed) and force entry into the kernel, just like a signal would, to process task_work
Then there's COOP_TASKRUN, which will still interrupt in-kernel interruptible waits and get the task_work processed, but it will not issue an IPI and it will not force a task running in userspace into the kernel to process task_work
And finally there's DEFER_TASKRUN, which keeps the task_work private, and only runs it when we wait on events, or if io_uring_get_events() is called.

vsolontsov-ll commented 4 months ago

Pipes aren't generally a good benchmark, unless you really are using pipes of course, as the wake side doesn't really mimic network workloads in general due to how wakeups are done.

Thanks you very much for your help. I'll find some time to explore the difference between descriptor types from this perspective. We need an IPC supported by io_uring which would allow to wake up a process from multiple others (pipe is not an option anyway -- was just a convenient one for a basic benchmark).

I know you added FUTEX_WAIT in 6.7, but our client won't get it in prod any soon.

axboe / liburing

[Question] Polling latency of IORING_OP_POLL_ADD vs epoll_wait. #1105