axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.77k stars 398 forks source link

Large perf regressions with linked timeouts for TCP send() operations #1005

Closed cmazakas closed 9 months ago

cmazakas commented 9 months ago

I'm in the middle of writing some benchmarks for my io_uring runtime and I'm noticing that if I link a timeout request to a TCP send(), I get somewhat large perf regressions.

My benchmark lives here.

The code in question is here.

Inlined:

  {
    /* Note that using IOSQE_IO_LINK with this request type requires the setting
     * of MSG_WAITALL in the flags argument, as a short send isn’t a considered
     * an error condition without that being set.
     */
    auto sqe = io_uring_get_sqe( ring );
    io_uring_prep_send( sqe, fd, buf_.data(), buf_.size(), MSG_WAITALL );
    io_uring_sqe_set_data( sqe, &sf );
    io_uring_sqe_set_flags( sqe, IOSQE_IO_LINK | IOSQE_FIXED_FILE );
  }

  {
    auto ts = &pstream_->ts_;
    auto sqe = io_uring_get_sqe( ring );
    io_uring_prep_link_timeout( sqe, ts, 0 );
    io_uring_sqe_set_data( sqe, nullptr );
  }

If I update this code to no longer link a timeout, I'll see my run time drop from 19s to 16s! This is actually pretty large considering how small the code change is.

Am I using io_uring's timeouts wrong or is this much of a perf regression expected?

isilence commented 9 months ago

I was always arguing that linked timeouts are expensive, and it should be more efficient to implement it in userspace as a some sort of bucketed time wheel with IORING_OP_TIMEOUT and IORING_OP_ASYNC_CANCEL. That's assuming that the timeout requirements are not too strict and most requests complete before it fires.

To you problem specifically, there was an optimisation bypassing most of timeout overhead if a request succeeds inline (in the submit syscall), which would usually be the case for sends unless the send queue is full. Does the benchmark pushing data without any pacing by any chance? It sounds like it's either that, or the optimisation doesn't work, or the difference is from an allocation we do for timeouts. Can you get some profiles?

perf record -g -- ./app && perf report > profile.txt

cmazakas commented 9 months ago

That's assuming that the timeout requirements are not too strict and most requests complete before it fires.

They aren't and they do! In my case, timeouts can be considered a rare event.

Does the benchmark pushing data without any pacing by any chance?

Yup, I call send() basically as soon as I physically can. Should I be doing something different here?

For what it's worth, I decouple creation of the SQEs from the submit calls. So I'll run through everything, generate a whole bunch of SQEs and then call io_uring_submit_and_wait(ring, 1); in the main loop just once, and then from there I'll reap w/e CQEs are available and execute the suspended tasks.

Can you get some profiles?

Going to try and work on this.

cmazakas commented 9 months ago

profile.txt

Alright, this is a reduced sample of my benchmark. Hopefully it can help shed some light. Let me know if there's anything else you'd need from me!

For the record, I'm okay with linked timeouts having a cost. I'm just trying to ascertain whether or not I'm simply using the library wrong or if there's something you guys (the io_uring experts) should be aware of.

cmazakas commented 9 months ago

profile.txt

This one might be a bit more realistic.

Seems like timeout stuff doesn't definitely come into play here but, man, almost all the time is spent in the TCP send part!

isilence commented 9 months ago

That's assuming that the timeout requirements are not too strict and most requests complete before it fires.

They aren't and they do! In my case, timeouts can be considered a rare event.

Yeah, that's what it usually is...

Does the benchmark pushing data without any pacing by any chance?

Yup, I call send() basically as soon as I physically can. Should I be doing something different here?

Pacing might be useful depending on the application, but I was only asking to understand the timeout part

For what it's worth, I decouple creation of the SQEs from the submit calls. So I'll run through everything, generate a whole bunch of SQEs and then call io_uring_submit_and_wait(ring, 1); in the main loop just once, and then from there I'll reap w/e CQEs are available and execute the suspended tasks.

Sounds good

This one might be a bit more realistic. Seems like timeout stuff doesn't definitely come into play here but, man, almost all the time is spent in the TCP send part!

The kernel net stack is heavy, and it only makes sense io_uring being thin in general. And frankly I'd prefer it taking less than in your profiles. Roughly __do_sys_io_uring_enter - sock_sendmsg - sock_recvmsg = 22%

From the profile:

--3.32%--io_link_timeout_prep
          |          
           --3.21%--__io_timeout_prep
                     |          
                     |--1.13%--io_alloc_async_data
                     |          |          
                     |           --1.02%--__kmalloc
                     |          
                      --1.13%--get_timespec64
                                |          
                                 --0.95%--_copy_from_user

3.21% is spent to init linked timeout requests, that's a lot... ~1% of which is the allocations I mentioned.

--1.51%--io_clean_op
          |          
           --1.41%--kfree

And it's the matching kfree

4.15%--__io_arm_ltimeout

There is also this one, which should only happen if a request truly goes async. Are there linked timeouts for receives as well?

--1.31%--hrtimer_try_to_cancel
          |          
           --1.13%--hrtimer_try_to_cancel.part.0
                     |          
                      --0.59%--__remove_hrtimer

And some additional overhead for removing timeouts.

The total damage from timeouts is at least 10.29% here. The allocations can be relatively easily optimised by caching (2.5% -> 0.5%?). Not conclusive why __io_arm_ltimeout is there, read+timeout? If so can you test if there is any difference when timeouts are only used for sends?

isilence commented 9 months ago

Btw, if you would be replacing it with OP_TIMEOUT, note that it has a multishot mode, i.e. you queue it up once and if fires CQEs every N ns.

cmazakas commented 9 months ago

There is also this one, which should only happen if a request truly goes async. Are there linked timeouts for receives as well?

There are! I've updated the code to eschew timeouts for multishot recv().

The total damage from timeouts is at least 10.29% here. The allocations can be relatively easily optimised by caching (2.5% -> 0.5%?). Not conclusive why __io_arm_ltimeout is there, read+timeout? If so can you test if there is any difference when timeouts are only used for sends?

There is a read+timeout. I have a multishot recv() that's linked against a timeout. I found this to be a relatively easy/efficient way of adding timeouts to such an op. I then have internal logic that detects cancellation by the linked timeout and if needed, reschedules everything.

I actually tried removing the timeouts from the multishot recv()s and it seems like my results are the same, i.e. 19.5s when timed out send() calls are used.

I also regenerated the profile as you asked, with

The kernel net stack is heavy, and it only makes sense io_uring being thin in general. And frankly I'd prefer it taking less than in your profiles. Roughly __do_sys_io_uring_enter - sock_sendmsg - sock_recvmsg = 22%

profile.txt

--89.85%--__do_sys_io_uring_enter
--54.35%--sock_sendmsg
--11.13%--sock_recvmsg
=> 89.85 - 54.35 - 11.13 = 24.37%

Hopefully this gives you more information. If you need anything else on my end, I can keep refactoring my code and changing it.

Looking at these timings, it definitely seems like multishot recv() is doing its job! I knew refactoring my runtime to use it would be worth it!

isilence commented 9 months ago

I actually tried removing the timeouts from the multishot recv()s and it seems like my results are the same, i.e. 19.5s when timed out send() calls are used.

That sounds suspicious as I don't see send requests polling sockets. That's good for the app, but I need to double check if the timeout optimisation actually works

The kernel net stack is heavy, and it only makes sense io_uring being thin in general. And frankly I'd prefer it taking less than in your profiles. Roughly __do_sys_io_uring_enter - sock_sendmsg - sock_recvmsg = 22%

profile.txt

--89.85%--__do_sys_io_uring_enter
--54.35%--sock_sendmsg
--11.13%--sock_recvmsg
=> 89.85 - 54.35 - 11.13 = 24.37%

Yeah, a lot. A side note, sends are taking so heavy here because it's a loopback, which means the send call chain also includes handling most of the receive part of the other end.

Hopefully this gives you more information. If you need anything else on my end, I can keep refactoring my code and changing it.

Pretty helpful, thanks! One more question, what kernel version you're using?

Looking at these timings, it definitely seems like multishot recv() is doing its job! I knew refactoring my runtime to use it would be worth it!

Great to hear that!

cmazakas commented 9 months ago

That sounds suspicious as I don't see send requests polling sockets. That's good for the app, but I need to double check if the timeout optimisation actually works

Hmm, does it matter then that I'm adding this to the multishot recv?

auto flags = IOSQE_FIXED_FILE | IOSQE_IO_LINK | IOSQE_BUFFER_SELECT;
auto sqe = io_uring_get_sqe( ring );
io_uring_prep_recv_multishot( sqe, fd, nullptr, 0, 0 );
io_uring_sqe_set_data( sqe, this );
io_uring_sqe_set_flags( sqe, flags );
sqe->ioprio |= IORING_RECVSEND_POLL_FIRST;
sqe->buf_group = buffer_group_id_;

Pretty helpful, thanks! One more question, what kernel version you're using?

I actually can't believe I forgot this. A relatively recent one:

exbigboss@exbigboss-ubuntu:~$ uname -a
Linux exbigboss-ubuntu 6.5.0-13-generic #13-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov  3 12:16:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
isilence commented 9 months ago

Hmm, does it matter then that I'm adding this to the multishot recv?

Numbers would be a better judge -- what's the performance diff b/w "no timeouts" vs "linked timeouts only for recv"? Sounds like they are roughly the same, which would mean that timeouts are long enough, each multishot request receives enough data / posts enough CQEs before cancelled, and so the overhead is amortized well. If so, it sounds fine to me, at least from the performance perspective, even though it can be subject to workload.

cmazakas commented 9 months ago

Btw, what was the ultimate verdict here? Is io_uring not applying the optimization? I also realized, I'm linking timeouts for connect() as well, so that may've skewed some things. I can regenerate the profile, of course. But even then, linked timeouts for connect() is a fixed one-time cost so it shouldn't be overbearingly large here.

My benchmark setups 5,000 clients in one thread to a server in another thread. So in total, there's 10,000 TCP sockets and I have the benchmark send 1,000 echos. So there's 1,000 sends and recvs on both sides here.

connect() should in theory only be a small part of this in comparison to the recv/send operations.

I also have a prototype with your advice: setup a multishot timer. I setup a multishot timeout operation per socket which does activity checking for the tcp::stream/tcp::client objects I have. The results are actually quite nice and gravitate towards net zero overhead. So overall, I think I'm going to go with this approach for handling timeouts.

I appreciate all the effort you've put into helping me, Pavel.

isilence commented 9 months ago

Btw, what was the ultimate verdict here? Is io_uring not applying the optimization?

Yes, that's what happened, I just sent out a patch

https://lore.kernel.org/all/cover.1701390926.git.asml.silence@gmail.com/

Can you remind your email? I'll stick it onto the patch as a Reported-by tag with your name on it, if you're up to it.

I also have a prototype with your advice: setup a multishot timer. I setup a multishot timeout operation per socket which does activity checking for the tcp::stream/tcp::client objects I have. The results are actually quite nice and gravitate towards net zero overhead. So overall, I think I'm going to go with this approach for handling timeouts.

That sounds great, and it's always good to have an example and a precedent for an approach we can point to! The patch will remove ~5.5% according to the profile, but even with that linked timeouts are not free.

I appreciate all the effort you've put into helping me, Pavel.

And thank you for helping getting to the bottom of it! In the ends it's indeed a silly regression we didn't notice

cmazakas commented 9 months ago

Can you remind your email? I'll stick it onto the patch as a Reported-by tag with your name on it, if you're up to it.

Sure it's, christian.mazakas@gmail.com

Sounds like everything's been handled and I'll be closing the issue. Thank you so much!