axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.89k stars 407 forks source link

Mute CQEs of send/write to reduce wakeups #1264

Open pyhd opened 1 month ago

pyhd commented 1 month ago

wait_timeout(nr) is generally a good way to reduce wakeups from kernel, while CQEs of send/write can bring unnecessary "noise", especially from plenty of zero-copy. In essence, it is difficult to estimate when send/write will return, yet their CQEs are generally not latency sensitive. So I think a possible solution is to flag MUTE_SUCCESS in the SQE, then its CQE will not be counted as wakeable.

if (sq_ready) {
    submit_and_wait_timeout(nr, 1ms);
} else {
    if (inflight_sends)
        wait_timeout(1, 100ms);  // even if no wakeup CQEs, muted CQEs will still be reaped in a poll way.
    else
        wait(1); // if no pending send/write CQEs
}
axboe commented 1 month ago

Yep this is not a bad idea, we've bounced around ideas for this very thing in the past as well. Send is a good example - generally they complete inline (eg immediatley), but it's not guaranteed. And while you don't need an immediate notification for them, generally you do want to see one so that you know the data it sent can get reused. Hence IOSQE_CQE_SKIP_SUCCESS isn't really useful for this case.

I think what we'd need is something like a low priority completion, in the sense that it doesn't need to wakeup the task waiting, but it should be included in the "I'm waiting for this number of events" accounting.

A quick work-around with the existing code may be to just discount the write/send in the wait_nr.

axboe commented 1 month ago

https://lore.kernel.org/io-uring/20241014205416.456078-1-axboe@kernel.dk/T/#m19db4fd576c4cf3c5a5ef3ea0b71e175a3574e15

Tossed out a suggestion for handling something like this.

redbaron commented 1 month ago

what if CQ is overflowing with now ignored CQEs and no wakeup worthy CQE has arrived?

axboe commented 1 month ago

There are several conditions that would still cause it to wake, like a short send/write (or an error), and overflow would be another one. Didn't cover the overflow case, but that will be done too. Anything but a fully successful send with a normal CQE posting would wake things up, naturally.

pyhd commented 1 month ago

@axboe

I think what we'd need is something like a low priority completion, in the sense that it doesn't need to wakeup the task waiting, but it should be included in the "I'm waiting for this number of events" accounting.

I suppose you want to put a backlog limit on ignorable events, but it will bring a new parameter to all existing wait_cqe variants. It might be a little confusing.

https://lore.kernel.org/io-uring/20241014205416.456078-1-axboe@kernel.dk/T/#m19db4fd576c4cf3c5a5ef3ea0b71e175a3574e15

Tossed out a suggestion for handling something like this.

I am afraid inline is not enough, because the number of inline is more predictable. On the other hand, async success and zc notifications are much out of our control, especially when inflight CQEs outnumber potential read/recv CQEs incredibly. Therefore, even if inline success can be ignored, the CQ ring may still be flooded by infight CQEs from previous rounds.

However, MUTE_SUCCESS could probably be less confusing. e.g. In a submit_wait_timeout(nr) syscall, the developer can expect explicitly nr incoming requests or errors, while any muted CQEs are just byproducts.