Open pyhd opened 1 month ago
Yep this is not a bad idea, we've bounced around ideas for this very thing in the past as well. Send is a good example - generally they complete inline (eg immediatley), but it's not guaranteed. And while you don't need an immediate notification for them, generally you do want to see one so that you know the data it sent can get reused. Hence IOSQE_CQE_SKIP_SUCCESS
isn't really useful for this case.
I think what we'd need is something like a low priority completion, in the sense that it doesn't need to wakeup the task waiting, but it should be included in the "I'm waiting for this number of events" accounting.
A quick work-around with the existing code may be to just discount the write/send in the wait_nr
.
Tossed out a suggestion for handling something like this.
what if CQ is overflowing with now ignored CQEs and no wakeup worthy CQE has arrived?
There are several conditions that would still cause it to wake, like a short send/write (or an error), and overflow would be another one. Didn't cover the overflow case, but that will be done too. Anything but a fully successful send with a normal CQE posting would wake things up, naturally.
@axboe
I think what we'd need is something like a low priority completion, in the sense that it doesn't need to wakeup the task waiting, but it should be included in the "I'm waiting for this number of events" accounting.
I suppose you want to put a backlog limit on ignorable events, but it will bring a new parameter to all existing wait_cqe variants. It might be a little confusing.
Tossed out a suggestion for handling something like this.
I am afraid inline is not enough, because the number of inline is more predictable. On the other hand, async success and zc notifications are much out of our control, especially when inflight CQEs outnumber potential read/recv CQEs incredibly. Therefore, even if inline success can be ignored, the CQ ring may still be flooded by infight CQEs from previous rounds.
However, MUTE_SUCCESS
could probably be less confusing. e.g. In a submit_wait_timeout(nr)
syscall, the developer can expect explicitly nr
incoming requests or errors, while any muted CQEs are just byproducts.
wait_timeout(nr)
is generally a good way to reduce wakeups from kernel, while CQEs of send/write can bring unnecessary "noise", especially from plenty of zero-copy. In essence, it is difficult to estimate when send/write will return, yet their CQEs are generally not latency sensitive. So I think a possible solution is to flagMUTE_SUCCESS
in the SQE, then its CQE will not be counted as wakeable.