Closed cmazakas closed 9 months ago
I was always arguing that linked timeouts are expensive, and it should be more efficient to implement it in userspace as a some sort of bucketed time wheel with IORING_OP_TIMEOUT
and IORING_OP_ASYNC_CANCEL
. That's assuming that the timeout requirements are not too strict and most requests complete before it fires.
To you problem specifically, there was an optimisation bypassing most of timeout overhead if a request succeeds inline (in the submit syscall), which would usually be the case for sends unless the send queue is full. Does the benchmark pushing data without any pacing by any chance? It sounds like it's either that, or the optimisation doesn't work, or the difference is from an allocation we do for timeouts. Can you get some profiles?
perf record -g -- ./app && perf report > profile.txt
That's assuming that the timeout requirements are not too strict and most requests complete before it fires.
They aren't and they do! In my case, timeouts can be considered a rare event.
Does the benchmark pushing data without any pacing by any chance?
Yup, I call send() basically as soon as I physically can. Should I be doing something different here?
For what it's worth, I decouple creation of the SQEs from the submit calls. So I'll run through everything, generate a whole bunch of SQEs and then call io_uring_submit_and_wait(ring, 1);
in the main loop just once, and then from there I'll reap w/e CQEs are available and execute the suspended tasks.
Can you get some profiles?
Going to try and work on this.
Alright, this is a reduced sample of my benchmark. Hopefully it can help shed some light. Let me know if there's anything else you'd need from me!
For the record, I'm okay with linked timeouts having a cost. I'm just trying to ascertain whether or not I'm simply using the library wrong or if there's something you guys (the io_uring experts) should be aware of.
This one might be a bit more realistic.
Seems like timeout stuff doesn't definitely come into play here but, man, almost all the time is spent in the TCP send part!
That's assuming that the timeout requirements are not too strict and most requests complete before it fires.
They aren't and they do! In my case, timeouts can be considered a rare event.
Yeah, that's what it usually is...
Does the benchmark pushing data without any pacing by any chance?
Yup, I call send() basically as soon as I physically can. Should I be doing something different here?
Pacing might be useful depending on the application, but I was only asking to understand the timeout part
For what it's worth, I decouple creation of the SQEs from the submit calls. So I'll run through everything, generate a whole bunch of SQEs and then call
io_uring_submit_and_wait(ring, 1);
in the main loop just once, and then from there I'll reap w/e CQEs are available and execute the suspended tasks.
Sounds good
This one might be a bit more realistic. Seems like timeout stuff doesn't definitely come into play here but, man, almost all the time is spent in the TCP send part!
The kernel net stack is heavy, and it only makes sense io_uring being thin in general. And frankly I'd prefer it taking less than in your profiles. Roughly __do_sys_io_uring_enter
- sock_sendmsg
- sock_recvmsg
= 22%
From the profile:
--3.32%--io_link_timeout_prep
|
--3.21%--__io_timeout_prep
|
|--1.13%--io_alloc_async_data
| |
| --1.02%--__kmalloc
|
--1.13%--get_timespec64
|
--0.95%--_copy_from_user
3.21% is spent to init linked timeout requests, that's a lot... ~1% of which is the allocations I mentioned.
--1.51%--io_clean_op
|
--1.41%--kfree
And it's the matching kfree
4.15%--__io_arm_ltimeout
There is also this one, which should only happen if a request truly goes async. Are there linked timeouts for receives as well?
--1.31%--hrtimer_try_to_cancel
|
--1.13%--hrtimer_try_to_cancel.part.0
|
--0.59%--__remove_hrtimer
And some additional overhead for removing timeouts.
The total damage from timeouts is at least 10.29% here. The allocations can be relatively easily optimised by caching (2.5% -> 0.5%?). Not conclusive why __io_arm_ltimeout
is there, read+timeout? If so can you test if there is any difference when timeouts are only used for sends?
Btw, if you would be replacing it with OP_TIMEOUT
, note that it has a multishot mode, i.e. you queue it up once and if fires CQEs every N ns.
There is also this one, which should only happen if a request truly goes async. Are there linked timeouts for receives as well?
There are! I've updated the code to eschew timeouts for multishot recv().
The total damage from timeouts is at least 10.29% here. The allocations can be relatively easily optimised by caching (2.5% -> 0.5%?). Not conclusive why __io_arm_ltimeout is there, read+timeout? If so can you test if there is any difference when timeouts are only used for sends?
There is a read+timeout. I have a multishot recv() that's linked against a timeout. I found this to be a relatively easy/efficient way of adding timeouts to such an op. I then have internal logic that detects cancellation by the linked timeout and if needed, reschedules everything.
I actually tried removing the timeouts from the multishot recv()s and it seems like my results are the same, i.e. 19.5s when timed out send() calls are used.
I also regenerated the profile as you asked, with
The kernel net stack is heavy, and it only makes sense io_uring being thin in general. And frankly I'd prefer it taking less than in your profiles. Roughly __do_sys_io_uring_enter - sock_sendmsg - sock_recvmsg = 22%
--89.85%--__do_sys_io_uring_enter
--54.35%--sock_sendmsg
--11.13%--sock_recvmsg
=> 89.85 - 54.35 - 11.13 = 24.37%
Hopefully this gives you more information. If you need anything else on my end, I can keep refactoring my code and changing it.
Looking at these timings, it definitely seems like multishot recv() is doing its job! I knew refactoring my runtime to use it would be worth it!
I actually tried removing the timeouts from the multishot recv()s and it seems like my results are the same, i.e. 19.5s when timed out send() calls are used.
That sounds suspicious as I don't see send requests polling sockets. That's good for the app, but I need to double check if the timeout optimisation actually works
The kernel net stack is heavy, and it only makes sense io_uring being thin in general. And frankly I'd prefer it taking less than in your profiles. Roughly __do_sys_io_uring_enter - sock_sendmsg - sock_recvmsg = 22%
--89.85%--__do_sys_io_uring_enter --54.35%--sock_sendmsg --11.13%--sock_recvmsg => 89.85 - 54.35 - 11.13 = 24.37%
Yeah, a lot. A side note, sends are taking so heavy here because it's a loopback, which means the send call chain also includes handling most of the receive part of the other end.
Hopefully this gives you more information. If you need anything else on my end, I can keep refactoring my code and changing it.
Pretty helpful, thanks! One more question, what kernel version you're using?
Looking at these timings, it definitely seems like multishot recv() is doing its job! I knew refactoring my runtime to use it would be worth it!
Great to hear that!
That sounds suspicious as I don't see send requests polling sockets. That's good for the app, but I need to double check if the timeout optimisation actually works
Hmm, does it matter then that I'm adding this to the multishot recv?
auto flags = IOSQE_FIXED_FILE | IOSQE_IO_LINK | IOSQE_BUFFER_SELECT;
auto sqe = io_uring_get_sqe( ring );
io_uring_prep_recv_multishot( sqe, fd, nullptr, 0, 0 );
io_uring_sqe_set_data( sqe, this );
io_uring_sqe_set_flags( sqe, flags );
sqe->ioprio |= IORING_RECVSEND_POLL_FIRST;
sqe->buf_group = buffer_group_id_;
Pretty helpful, thanks! One more question, what kernel version you're using?
I actually can't believe I forgot this. A relatively recent one:
exbigboss@exbigboss-ubuntu:~$ uname -a
Linux exbigboss-ubuntu 6.5.0-13-generic #13-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov 3 12:16:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Hmm, does it matter then that I'm adding this to the multishot recv?
Numbers would be a better judge -- what's the performance diff b/w "no timeouts" vs "linked timeouts only for recv"? Sounds like they are roughly the same, which would mean that timeouts are long enough, each multishot request receives enough data / posts enough CQEs before cancelled, and so the overhead is amortized well. If so, it sounds fine to me, at least from the performance perspective, even though it can be subject to workload.
Btw, what was the ultimate verdict here? Is io_uring not applying the optimization? I also realized, I'm linking timeouts for connect() as well, so that may've skewed some things. I can regenerate the profile, of course. But even then, linked timeouts for connect() is a fixed one-time cost so it shouldn't be overbearingly large here.
My benchmark setups 5,000 clients in one thread to a server in another thread. So in total, there's 10,000 TCP sockets and I have the benchmark send 1,000 echos. So there's 1,000 sends and recvs on both sides here.
connect() should in theory only be a small part of this in comparison to the recv/send operations.
I also have a prototype with your advice: setup a multishot timer. I setup a multishot timeout operation per socket which does activity checking for the tcp::stream
/tcp::client
objects I have. The results are actually quite nice and gravitate towards net zero overhead. So overall, I think I'm going to go with this approach for handling timeouts.
I appreciate all the effort you've put into helping me, Pavel.
Btw, what was the ultimate verdict here? Is io_uring not applying the optimization?
Yes, that's what happened, I just sent out a patch
https://lore.kernel.org/all/cover.1701390926.git.asml.silence@gmail.com/
Can you remind your email? I'll stick it onto the patch as a Reported-by tag with your name on it, if you're up to it.
I also have a prototype with your advice: setup a multishot timer. I setup a multishot timeout operation per socket which does activity checking for the
tcp::stream
/tcp::client
objects I have. The results are actually quite nice and gravitate towards net zero overhead. So overall, I think I'm going to go with this approach for handling timeouts.
That sounds great, and it's always good to have an example and a precedent for an approach we can point to! The patch will remove ~5.5% according to the profile, but even with that linked timeouts are not free.
I appreciate all the effort you've put into helping me, Pavel.
And thank you for helping getting to the bottom of it! In the ends it's indeed a silly regression we didn't notice
Can you remind your email? I'll stick it onto the patch as a Reported-by tag with your name on it, if you're up to it.
Sure it's, christian.mazakas@gmail.com
Sounds like everything's been handled and I'll be closing the issue. Thank you so much!
I'm in the middle of writing some benchmarks for my io_uring runtime and I'm noticing that if I link a timeout request to a TCP send(), I get somewhat large perf regressions.
My benchmark lives here.
The code in question is here.
Inlined:
If I update this code to no longer link a timeout, I'll see my run time drop from 19s to 16s! This is actually pretty large considering how small the code change is.
Am I using io_uring's timeouts wrong or is this much of a perf regression expected?