Closed vsolontsov-ll closed 4 months ago
Interesting! Should be easy to write a reproducer for this. Do you have one? Would save me the time. There's absolutely no reason why those too should differ, and 1700-2100 usec sounds pretty out there in terms of timing.
What kernel are you on?
Interesting! Should be easy to write a reproducer for this. Do you have one? Would save me the time. There's absolutely no reason why those too should differ, and 1700-2100 usec sounds pretty out there in terms of timing.
I don't have a clean one which I can provide. I will prepare one on the weekend.
What kernel are you on?
It's 6.1.
I just wrote my own, it was a 2 min job. From initial results, I see ~100 usec, which seems a lot more reasonable. That's with a thread writing a timestamp every 100 msec, and the main thread having a multishot poll and then doing a read(2) based on that trigger and dumping how long the time was since the thread wrote the timeval and until we read it. So that includes the pipe write time, and the pipe read time as well in those 100 usec.
Since yours is an order of magnitude higher than that, that seems odd. I'll poke a bit... Maybe it's the wait_cqes part. I'll need to see your actual code as it seems fine here.
Same results using io_uring_wait_cqes(ring, &cqe, 1, &ts, NULL) with ts having tv_sec = 1.
I didn't reproduce, as you can read above it's not showing any difference, and the results are not showing a difference and are much more inline with what I'd expect. Yours seem really slow. So a reproducer would be nice, but mostly because I suspect you're doing something wrong.
My bad -- my figures are in nanoseconds. I originally wrote the benchmark in TSC but then moved to clock_gettime() and nonos. Sorry for confusion, I can imagine your thoughts about 1msec latency.
Ahhh nanoseconds makes a lot more sense! Here's my test case:
axboe@intel ~> ./pipe-lat-bench 1 100
Using epoll, wait usec 100
Avg wakeup time: 6
axboe@intel ~> ./pipe-lat-bench 0 100
Using io_uring, wait usec 100
Avg wakeup time: 6
As mentioned, my numbers include the pipe write time and pipe read time.
This is really a good use case for read multishot, using io_uring as an epoll replacement just as a notification mechanism isn't going to bring you any performance improvements, it'll be about the same performance for that. But it should be about the same performance.
In any case, once you make a clean reproducer over the weekend, send it my way and I'll take a look and see what's going on.
Beg me pardon for the false alert -- it looked obvious on my end((.
So... The clean reproducer showed that the only potentially meaningful difference between io_uring and epoll latency for basic polling of an FD is in busy-poll mode.
I think it's explainable -- to do the taskwork it needs either an interrupt or a syscall. And probably that's why the flags IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
look attractive in this mode.
(io_uring in busy-poll is perfect when you have primary communication channels via shm rings while still needs to poll a number of FDs.)
On my end that was:
epoll
with 1 second timeout in my original benchmark was just wrong. Because of a recent bug in the reactor, epoll was permanently in busy-wait mode.Anyway, here is the benchmark: poll-bench-mt.c.gz
My local results are (i9-12900F)
vsol@vsol-deb:~/repo/liburing/examples$ ./poll-bench-mt
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 4058 3773 4876 7336
io_uring busy-poll latency. results. avg/med/%90/%99: 1988 1936 2110 2827
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_DEFER_TASKRUN
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 4282 3993 5324 7280
io_uring busy-poll latency. results. avg/med/%90/%99: 1505 1479 1671 2446
===
epoll poll latency. timeout=1s. results. avg/med/%90/%99: 4331 4035 5448 7575
epoll busy-poll latency. results. avg/med/%90/%99: 1316 1294 1499 1986
Edit And just for history -- the same benchmark with pinning producer and consumer on isolated "tickless" cores (i9-12900F, hyperthreading siblings are off):
vsol@vsol-deb:~/repo/liburing/examples$ CORE_READER=4 CORE_WRITER=6 ./poll-bench-mt
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 1803 1768 1938 2501
io_uring busy-poll latency. results. avg/med/%90/%99: 1050 1014 1197 1630
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_DEFER_TASKRUN
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 1940 1872 2233 2846
io_uring busy-poll latency. results. avg/med/%90/%99: 814 782 968 1375
===
epoll poll latency. timeout=1s. results. avg/med/%90/%99: 1689 1644 1828 2412
epoll busy-poll latency. results. avg/med/%90/%99: 675 647 799 1248
I assume it can be closed now
Yep all good, thanks for providing the test, I'll try and run it here when time permits. One note - your DEFER_TASKRUN results aren't valid, it doesn't make sense to setup with COOP_TASKRUN and DEFER_TASKRUN. You really want the 2nd test to be COOP_TASKRUN | TASKRUN_FLAG, and the 3rd addition then to be DEFER_TASKRUN | SINGLE_ISSUER. One more note on the latter, it won't work straight up with just peak, as you need to process events. The busy poll loop would need an io_uring_get_events() call as well for DEFER_TASKRUN.
With those tweaks, here's what my AMD test box says (7763):
axboe@r7525 ~> ./poll-bench-mt 0.166s
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 33452 33803 3742039985
io_uring busy-poll latency. results. avg/med/%90/%99: 1706 1493 20935891
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 32776 32672 3723038823
io_uring busy-poll latency. results. avg/med/%90/%99: 3348 1253 1844189365
===
io_uring with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 31663 32190 3552740446
io_uring busy-poll latency. results. avg/med/%90/%99: 1388 1222 16934689
===
epoll poll latency. timeout=1s. results. avg/med/%90/%99: 32138 32131 3585766975
epoll busy-poll latency. results. avg/med/%90/%99: 1386 1243 16034899
axboe@r7525 ~> CORE_READER=0 CORE_WRITER=0 ./poll-bench-mt 40.529s
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 9258 9277 1024916090
io_uring busy-poll latency. results. avg/med/%90/%99: 2301 1432 301557619
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 10734 9728 1032966485
io_uring busy-poll latency. results. avg/med/%90/%99: 2386 1393 233471283
===
io_uring with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 9770 9708 1023930928
io_uring busy-poll latency. results. avg/med/%90/%99: 2277 1413 227464171
===
epoll poll latency. timeout=1s. results. avg/med/%90/%99: 9956 9718 1108134785
epoll busy-poll latency. results. avg/med/%90/%99: 2241 1422 227561896
and here's what my intel (8458P) does:
axboe@intel ~> ./poll-bench-mt
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 92842 102937 107092 152077
io_uring busy-poll latency. results. avg/med/%90/%99: 2527 2572 26433435
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 91478 101717 106511 132096
io_uring busy-poll latency. results. avg/med/%90/%99: 2064 2040 21302921
===
io_uring with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 98114 102265 107650 163254
io_uring busy-poll latency. results. avg/med/%90/%99: 2060 2057 21202278
===
epoll poll latency. timeout=1s. results. avg/med/%90/%99: 92528 101507 108178 251974
epoll busy-poll latency. results. avg/med/%90/%99: 1951 1952 20062114
axboe@intel ~> CORE_READER=0 CORE_WRITER=0 ./poll-bench-mt
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 9486 9437 981110366
io_uring busy-poll latency. results. avg/med/%90/%99: 1041 959 10557530
===
io_uring with IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 10094 9705 994840330
io_uring busy-poll latency. results. avg/med/%90/%99: 1019 939 10476876
===
io_uring with IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER
io_uring poll latency. timeout=1s. results. avg/med/%90/%99: 10088 9719 987236190
io_uring busy-poll latency. results. avg/med/%90/%99: 1009 944 10336515
===
epoll poll latency. timeout=1s. results. avg/med/%90/%99: 9578 9443 967026129
epoll busy-poll latency. results. avg/med/%90/%99: 1108 1052 11375969
which is pretty close. Pipes aren't generally a good benchmark, unless you really are using pipes of course, as the wake side doesn't really mimic network workloads in general due to how wakeups are done. This is apparent in just pinning both consumer and provider on the same CPU is obviously vastly faster.
Worth noting that even with pinning, the results don't seem that stable. So take the above with a grain of salt!
it doesn't make sense to setup with COOP_TASKRUN and DEFER_TASKRUN
Ah, thanks, good to know! Just to clarify, does the COOP_TASKRUN | TASKRUN_FLAG
disable IPIs and still allow the default processing of the taskwork in all syscalls?
If there a combination which allows me to get a TSKRUN_FLAG, so I know that I got to do io_uring_enter(IORING_ENTER_GETEVENTS)
and keep other syscalls clean from the extra work?
Yeah, you can add TASKRUN_FLAG with SINGLE_ISSUER|DEFER_TASKRUN and get that.
Just to clarify, does the COOP_TASKRUN | TASKRUN_FLAG disable IPIs and still allow the default processing of the taskwork in all syscalls?
Yep
Yeah, you can add TASKRUN_FLAG with SINGLE_ISSUER|DEFER_TASKRUN and get that.
and there will be no interrupts?
Just to clarify, in order of signaling:
Pipes aren't generally a good benchmark, unless you really are using pipes of course, as the wake side doesn't really mimic network workloads in general due to how wakeups are done.
Thanks you very much for your help. I'll find some time to explore the difference between descriptor types from this perspective. We need an IPC supported by io_uring
which would allow to wake up a process from multiple others (pipe is not an option anyway -- was just a convenient one for a basic benchmark).
I know you added FUTEX_WAIT in 6.7, but our client won't get it in prod any soon.
Hi, Edit: Accurate test showed no issues with polling a file descriptor via
io_uring
vsepoll
. The only difference is in busy-poll mode (io_uring_peek_cqe()
orepoll_wait(0)
in a tight loop), where epoll allows to gain some ticks.Original text: It seems we are on the very first and quite usual phase of migrating system from epoll to io_uring, I just transformed our reactor component for using io_uring with
IORING_OP_POLL_ADD
. Surprisingly I found that the time from writing to a pipe (I tested with eventfd too -- no difference) till waking up fromio_uring_wait_cqes()
is way higher than the same forepoll_wait()
.In other words producing thread takes a timestamp t0 right before writing to an fd. Consuming thread hangs on
io_uring_wait_cqes(..., 1sec)
with MULTISHOT POLL armed (or onepoll_wait()
). Once it getscqe
withEPOLLIN
, it takes timestamp t1 and reads from the fd.(t1-t0)
deltas are collected.Registering the file descriptor didn't help.
The average values in nsec are
epoll_wait(0)
and no syscalls for uring) looks better: uring goes down to 900-1100 usec.So I'm wondering if this is a bad scenario for io_uring by design or it can be improved somehow. I'll be happy to provide a clean benchmark if needed.
NB I constantly set
IORING_SETUP_SINGLE_ISSUER
and played withIORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_DEFER_TASKRUN
.`IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
improves the busy-wait scenario while makes a bit worse the timed-wait one.