axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.85k stars 402 forks source link

RCU calls made by the NAPI busy poll code generates context switches occupying 50% of the time of the CPU hosting the sqp thread #1190

Closed lano1106 closed 2 months ago

lano1106 commented 2 months ago

When SQPOLL busy poll, it will totally let any attached rings starve. If those other rings are also configured for busy polling, by not servicing them for some time the IRQ will be reenabled and this will pump new events that SQPOLL will notice and this will interrupt the busy poll. (if you play with the napi_defer_hard_irqs and gro_flush_timeout settings)

However, if there is only 1 NAPI device shared by all the io_uring by doing something like: ethtool -L enp39s0 combined 1

by polling the device for a io_uring for a very long time it will also inhibit the device from generating interrupts that would normally break the busy loop so that sqpoll process the other rings...

My current workaround is to specify a very small busy poll interval... I currently have no better idea but the ideal solution would be looping through all the attached busy looping io rings in the same way that all the NAPI devices of the context are iterated to ensure fairness...

lano1106 commented 2 months ago

I did an experiment... I did disable NAPI busy polling on my rings... The attached ring is still starving and is not serviced fairly... setup: I configure the NIC to have a single NAPI queue with ethtool -L enp39s0 combined 1 my program starts at 2024-07-26 14:20:37 ring 1 thread open its sockets and connect them to the server attached ring 2 thread open its sockets and connect them to the server at the same time

before the current second at 2024-07-26 14:20:37 ends, ring 1 sockets are all opened, connected, completed their HTTP/WebSocket handshakes.

attached ring 2 sockets connection are only reported at 2024-07-26 14:21:16 and the server times out and close its side of the connection. IOW, that setup is inoperable...

I am playing around by not fully understand what I am doing... I have found a settings that does what I want:

result: Zero IRQ generated by the NIC device no local timer interrupts on 2 of my isolated nohz_full cpus. (the only one still having local timer interrupts is the SQPOLL CPU)

lano1106 commented 2 months ago

cpu3 io ring does receive its events but they seem very delayed which makes the operation impossible...

a gro_flush_timeout value set 100000 is still too high. I have reduced it to 10000, this did seem to help but not enough... I guess that I could continue reducing it until I get a good responsiveness...

for the moment, I have put back napi_defer_hard_irqs to 0. NAPI interrupts are generated on all CPUs but at least, this is a workable setup...

isilence commented 2 months ago

So, first, I hear that SQPOLL napi busy polling is not the reason for the problem, since you tried without it.

Also looking at what you was trying, it sounds like you suspect that the problem might be coming from the driver's side, but it could also be an io_uring issue.

Let try to check the second one. SQPOLL just cycles through all attached rings in the order they were added and tries to execute requests. What we want to check is whether the SQPOLL thread ever gets to the second ring execution in a reasonable amount of time and how much it spins in a single loop for each ring. What tools do we have? bpftrace? Can you try a debug kernel?

Do you have much traffic to keep it busy? Does the SQPOLL thread use 100% of CPU? Are softirqs handled by the same CPU on which SQPOLL runs or it's randomised?

lano1106 commented 2 months ago

NAPI busy poll is definitely out of the suspect list...

SQPOLL thread is very close to 100% (99,3-99.7). ring 1 sockets have a lot of traffic at least a hundred of packets per second ring 2 sockets quieter but the WebSocket protocol sends a ping once every second if inactive.

The only setting that works is: 1 NAPI device per CPU and napi_defer_hard_irqs set to 0.

As soon as I set napi_defer_hard_irqs high enough to never let the device issue interrupts (ie:1000), the second attached ring misfunction.

I have tried napi_defer_hard_irqs 0 and 1 NAPI device bound to CPU0 which would have been acceptable. The goal that I try to achieve is to avoid having interrupts on CPU1 and 3.

What is the name for softirq in /proc/interrupts?

as I am writing this, in the current setup that works (napi_defer_hard_irqs:0, 4 NAPI devs),

Local timer interrupts: active on all cpus I have no idea why. CPU1,2,3 are isolated. ie: the only task assigned to CPU2 is the SQPOLL thread IRQ work interrupts: active on CPU1,2,3 but not 0.

I am not familiar with bpftrace... I will need to look into it... Beside that, the used kernel has been compiled by me... It is very easy for me to add traces and recompile if you can think of something that would be helpful to trace...

lano1106 commented 2 months ago

Another setup that I want to check if it works is

this is among the simplest possible setups. I have tried several combinations... I am not sure if I did try this one

this setup does not work at all

about the IRQs with this setup: NAPI irq on CPU0 some "Function call interrupts" on CPU1 (working ring1 CPU) On SQPOLL CPU (2):

lano1106 commented 2 months ago

with the setup that works:

All 4 CPUs received NAPI irqs All 4 CPUS have "Local timer interrupts" CPU 1,2,3 (not 0) have "IRQ work interrupts" SQPOLL CPU (2) have few "Rescheduling interrupts"

my next experiment will be:

It makes NAPI totally stops generating IRQs as expected but attached CPU3 ring2 sockets malfunction... I'll redo it to document carefully the whole interrupt pattern...

lano1106 commented 2 months ago

bpftrace

So, first, I hear that SQPOLL napi busy polling is not the reason for the problem, since you tried without it.

Also looking at what you was trying, it sounds like you suspect that the problem might be coming from the driver's side, but it could also be an io_uring issue.

Let try to check the second one. SQPOLL just cycles through all attached rings in the order they were added and tries to execute requests. What we want to check is whether the SQPOLL thread ever gets to the second ring execution in a reasonable amount of time and how much it spins in a single loop for each ring. What tools do we have? bpftrace? Can you try a debug kernel?

Do you have much traffic to keep it busy? Does the SQPOLL thread use 100% of CPU? Are softirqs handled by the same CPU on which SQPOLL runs or it's randomised?

I am currently looking at bpftrace... wow... this is a tool that I was completely unaware that it existed. It looks almost like magic... but is it applicable to what we would like to probe?

do you have one liner examples that you used to debug io_uring issues?

If I look in io_uring/sqpoll.c, most functions are static... Are these functions probable with bpftrace?

lano1106 commented 2 months ago

I'll need some help... where are probes come from:

$ sudo bpftrace -l | grep io_sq_thread
kfunc:vmlinux:io_sq_thread
kfunc:vmlinux:io_sq_thread_finish
kfunc:vmlinux:io_sq_thread_park
kfunc:vmlinux:io_sq_thread_stop
kfunc:vmlinux:io_sq_thread_unpark
kprobe:io_sq_thread
kprobe:io_sq_thread_finish
kprobe:io_sq_thread_park
kprobe:io_sq_thread_stop
kprobe:io_sq_thread_unpark

can we add more?

if __io_sq_thread() was probable, a small histogram based on the ctx input param would be very insightful...

lano1106 commented 2 months ago

In https://www.kernel.org/doc/html/latest/bpf/kfuncs.html

it is written:

There are two ways to expose a kernel function to BPF programs, either make an existing function in the kernel visible, or add a new wrapper for BPF.

does that mean that if I would like __io_sq_thread probable, all that would be needed is to remove the function static attribute and recompile the kernel?

why is the static io_sq_thread exposed? Is there some io_uring bpf wrappers defined somewhere?

isilence commented 2 months ago

In https://www.kernel.org/doc/html/latest/bpf/kfuncs.html

it is written:

There are two ways to expose a kernel function to BPF programs, either make an existing function in the kernel visible, or add a new wrapper for BPF.

does that mean that if I would like __io_sq_thread probable, all that would be needed is to remove the function static attribute and recompile the kernel?

Yes, that should do the trick if it's not available. In any case I'll write a script for you when I get back to it.

Another question, do you use any multishot requests?

lano1106 commented 2 months ago

It never stops to amaze me what is possible with software...

yes I am a big multishot user. I use them massively with io_uring_prep_recvmsg_multishot().

on both rings.

kernel v6.9.10 is currently recompiling...

Another idea that did pop in my head is that I attach the ring2 to piggyback on the ring1 sqpoll and benefit from async io done in the background but it is very easy to run ring2 detached for the sake of seeing if the result will be different.

isilence commented 2 months ago

Another idea that did pop in my head is that I attach the ring2 to piggyback on the ring1 sqpoll and benefit from async io done in the background but it is very easy to run ring2 detached for the sake of seeing if the result will be different.

Yeah, that's a good idea

lano1106 commented 2 months ago

I have the function in my probes now:

kfunc:vmlinux:__io_sq_thread
kfunc:vmlinux:io_sq_thread
kfunc:vmlinux:io_sq_thread_finish
kfunc:vmlinux:io_sq_thread_park
kfunc:vmlinux:io_sq_thread_stop
kfunc:vmlinux:io_sq_thread_unpark
kprobe:__io_sq_thread
kprobe:io_sq_thread
kprobe:io_sq_thread_finish
kprobe:io_sq_thread_park
kprobe:io_sq_thread_stop
kprobe:io_sq_thread_unpark
isilence commented 2 months ago

why is the static io_sq_thread exposed? Is there some io_uring bpf wrappers defined somewhere?

It depends on whether compiler decided to inline the function and compile the symbol out of the binary or not, kprobe don't need any special wrappers. removing "static" usually works, even better if you stick noinline to the function.

lano1106 commented 2 months ago

I have tried CPU3 ring2 detached from the ring 1 SQPOLL, it seems to be a little bit better but not by much...

I am trying out so many combinations that I almost lose track of what I have just tried. So here is the full setup:

io completions are reported several seconds after they happen on ring2. It is not limited to the TCP socket events. Part of init process, some test communication is performed between CPU1 thread and CPU3 thread in the form of placing in a thread-safe queue a test request and notify the cpu3 thread by signaling a libev ev_async_watcher (this is an eventfd). Even this event that is unrelated to the NIC takes several seconds before CPU3 thread is reported that event...

I am doing a second test with

same bad result... takes several seconds before the CPU3 thread is notified of io event.

Here is another detail that I omitted to report. cpu3 thread, SQPOLL attach or not is also polling the io_uring. It is passing a timeout value of 0

isilence commented 2 months ago

I have tried CPU3 ring2 detached from the ring 1 SQPOLL, it seems to be a little bit better but not by much...

Confusing, maybe it's not an io_uring issue. I assume you have enough of extra CPUs/cores to run that extra SQPOLL thread, right? Do you have a reproducer?

lano1106 commented 2 months ago

yeah... idk... I have updated the comment that you commented... io_uring is somehow involved... there are some weird interactions happening... the problem is not limited to network events...

this might have something to do with my nohz_full setting... I have a hard time getting rid of interrupts... despite all my efforts, some interrupts succeed in passing through... maybe the difference between cpu1 thread which works and cpu3 thread which does not, is the level of success of getting rid of the interrupts on the CPU...

lano1106 commented 2 months ago

I have tried CPU3 ring2 detached from the ring 1 SQPOLL, it seems to be a little bit better but not by much...

Confusing, maybe it's not an io_uring issue. I assume you have enough of extra CPUs/cores to run that extra SQPOLL thread, right? Do you have a reproducer? I am a bit short of cores... I have disabled hyperthreading for the improved cache performance this is providing.

I have 3 cores isolated to have a single task assigned to them that is expected to run close to 100%... and I have CPU0 in charge of managing the rest of the system... IRQ, all the other processes and the main program threads...

I do not have a reproducer written... I guess that the solution may come down to write one...

the more I think about the problem, the more I think this might be related to interrupts and how my nohz_full settings interacts...

isilence commented 2 months ago

I have 3 cores isolated to have a single task assigned to them that is expected to run close to 100%

Like in 3 threads taking all 3 cores? Or 1 thread/task taking just one CPU/core?

the more I think about the problem, the more I think this might be related to interrupts and how my nohz_full settings interacts...

If it's really CPU bound (excluding polling times), then there might be enough of problems with busy polling / sqpoll, especially if there are multiple SQPOLL threads trying to consume a single CPU.

One option is to try normal rings and without napi polling and see how that behaves. And try to run a bpftrace script [1] with SQPOLL, let's see if it'd tell us anything interesting.

[1] https://gist.github.com/isilence/22a44b18c431b300ff0a39cb417c44e7

lano1106 commented 2 months ago

I have 3 cores isolated to have a single task assigned to them that is expected to run close to 100%

Like in 3 threads taking all 3 cores? Or 1 thread/task taking just one CPU/core?

yes. 3 threads. It is not CPU bound. It is latency bound. To simplify things, my program is giant set of busy loops. Putting a task to sleep and waking it up when something happens is too slow.

the more I think about the problem, the more I think this might be related to interrupts and how my nohz_full settings interacts...

If it's really CPU bound (excluding polling times), then there might be enough of problems with busy polling / sqpoll, especially if there are multiple SQPOLL threads trying to consume a single CPU.

One option is to try normal rings and without napi polling and see how that behaves. And try to run a bpftrace script [1] with SQPOLL, let's see if it'd tell us anything interesting.

[1] https://gist.github.com/isilence/22a44b18c431b300ff0a39cb417c44e7

I did follow a lead that went to nowhere. I did mention the eventfd not reported by io_uring. The situation of more complex than that. libev optimize away the write to the eventfd if it knows that the loop thread is not waiting for events in the kernel. In that case, it uses an atomic_flag. It came to my mind that perhaps I had a simple memory ordering issue. I did check this hypothesis but it turns out to be invalid. I forgot the detail of that part of my system since it has been working perfectly for so long... The libev async_watcher is not signaled as long as the cpu3 thread TCP connection is not up... So in that regard, everything works as designed.

Of course, a lot of things can be simplified. CPU3 ring2, can be detached, with no NAPI polling. I can even make my thread block and wait for events inside the kernel.

my usage of io_uring is a two stages setup. First stage, io_uring is used as a select() to perform an SSL handshake. Once the SSL connection is established, second stage kicks in where io_uring usage is then morph into a multishot recvmsg async io.

So what I am witnessing is the following. A TCP connect() is initiated by my WS lib for openssl (outside io_uring). Next, the socket is registered to io_uring to be notified when the socket becomes ready for writing and this notification takes an eternity (several seconds... or a connect timeouts occur)...

in terms of collecting traces it should be very simple and could be done without being flooded by a massive amount of traces...

the more I think about it... io_uring must not have nothing to do... it might be the driver or my system config...

I am just a bit unsure how to debug this type of issue...

I will run your script and report back the result!

A new test idea did pop in my mind. In my problematic setup of 1 NAPI device pinned on CPU0, would a simple blocking connect initiated on CPU3 would work?

if it fails, that would be an irrefutable reproducer for the driver authors...

lano1106 commented 2 months ago

from my last idea, if it is inconclusive, a design for an io_uring reproducer is starting to take shape in my mind...

if my connect() succeeds, I would do the following:

have an io_uring ring on CPU1 consuming a TCP stream (do you know some publicly available service that could be used as a TCP stream source?) while a connect is initiated on CPU3...

lano1106 commented 2 months ago

speaking of interrupts, quick question for you Pavel... I have saw that io_uring_create() explicit forbids using IORING_SETUP_SQPOLL with IORING_SETUP_COOP_TASKRUN.

there does not seem to have the same test with IORING_SETUP_ATTACH_WQ and IORING_SETUP_COOP_TASKRUN (or it is tested elsewhere)...

is it a valid flag combination? it seems like it is very similar to the first one...

lano1106 commented 2 months ago

Pavel, ok, TBH, because kernel dev is not my main dev activity, the small details of how everything works is somehow very far in my memory...

I have decided to roll my sleeves and reopen my 2 Kernel reference books:

They both date back from the 2.6 era but they have withstanded the test of time on the basics (minus maybe the SLAB chapter recently removed)

Something that I have omitted to mention is that my kernel is built with CONFIG_PREEMPT_NONE.

and I am starting to understand that without any interrupts on a CPU, the NET_TX_SOFTIRQ might have a hard time to run. but this makes me wonder why the situation is different for the CPU1 thread.

is the SQPOLL thread code does something on the behalf of its creator to invoke the softirqs that it does not do for attached rings?

here are the function from which data is sent on the network:

static int
iouring_write(EV_P_ int fd, void *addr, int len)
{
  ev_io *w = (ev_io *)anfds [fd].head;
  struct io_uring_sqe *sqe;
  sqe = io_uring_get_sqe(&iouring_ring);
  io_uring_prep_send(sqe, fd, addr, len, 0);
  io_uring_sqe_set_data(sqe, iouring_build_user_data(IOURING_WRITE, fd,
                                                     anfds [fd].egen, 0));
  io_uring_submit(&iouring_ring);
  return 0;
}

static int
iouring_write_msg(EV_P_ int fd, struct msghdr *msg)
{
  ev_io *w = (ev_io *)anfds [fd].head;
  struct io_uring_sqe *sqe;
  sqe = io_uring_get_sqe(&iouring_ring);
  io_uring_prep_sendmsg(sqe, fd, msg, 0);
  io_uring_sqe_set_data(sqe, iouring_build_user_data(IOURING_SENDMSG, fd,
                                                     anfds [fd].egen, 0));
  io_uring_submit(&iouring_ring);
  return 0;
}
lano1106 commented 2 months ago

I think that I have found something...

imagine that SQPOLL thread is running on an isolated CPU having no hardware interrupt... When will invoke_softirq() be called on that CPU following a write in the network stack?

AFAIK, a pending NET_TX_SOFTIRQ could wait for a very long time before being serviced...

and I have just come to think about it... if it was possible for SQPOLL thread to explicitly awaken the softirq task, this might reduce io_uring tx latency even more...

even better instead of waking up the softirqd task, why not calling directly do_softirq() directly from sqpoll code?

I am starting to have the feeling that my NAPI busy poll setup will work soon...

lano1106 commented 2 months ago

Pavel,

here is the result of running your script... I am not sure what I am looking at...

@dt_to_exec[0xffff95b43444af80, 0xffff95b40c1bb800]: 
[2K, 4K)               7 |@@@@@@@@@@@                                         |
[4K, 8K)               8 |@@@@@@@@@@@@                                        |
[8K, 16K)             10 |@@@@@@@@@@@@@@@                                     |
[16K, 32K)            16 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
[32K, 64K)             5 |@@@@@@@                                             |
[64K, 128K)            0 |                                                    |
[128K, 256K)           2 |@@@                                                 |
[256K, 512K)           0 |                                                    |
[512K, 1M)             3 |@@@@                                                |
[1M, 2M)              15 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[2M, 4M)               8 |@@@@@@@@@@@@                                        |
[4M, 8M)               9 |@@@@@@@@@@@@@@                                      |
[8M, 16M)             19 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[16M, 32M)            16 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
[32M, 64M)            26 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            |
[64M, 128M)           33 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[128M, 256M)          30 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
[256M, 512M)          24 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
[512M, 1G)            10 |@@@@@@@@@@@@@@@                                     |
[1G, 2G)               5 |@@@@@@@                                             |
[2G, 4G)              28 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[4G, 8G)               3 |@@@@                                                |
[8G, 16G)              0 |                                                    |
[16G, 32G)            30 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |

@dt_to_exec[0xffff95b4344497c0, 0xffff95b40b115800]: 
[2K, 4K)              47 |@@@@@                                               |
[4K, 8K)             270 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[8K, 16K)            180 |@@@@@@@@@@@@@@@@@@@                                 |
[16K, 32K)           473 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)            79 |@@@@@@@@                                            |
[64K, 128K)            4 |                                                    |
[128K, 256K)           1 |                                                    |

@dt_to_exec[0xffff95b43444df00, 0xffff95b40fc98000]: 
[512, 1K)             29 |                                                    |
[1K, 2K)          564265 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)           55692 |@@@@@                                               |
[4K, 8K)           12233 |@                                                   |
[8K, 16K)             77 |                                                    |

fyi, I have written a patch...

diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index b3722e5275e7..2a5cac1957d8 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -319,6 +319,10 @@ static int io_sq_thread(void *data)
                        if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
                                sqt_spin = true;
                }
+               if (local_softirq_pending() & (NET_TX_SOFTIRQ|NET_RX_SOFTIRQ)) {
+                       do_softirq();
+                       sqt_spin = true;
+               }
                if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
                        sqt_spin = true;

my first impression was that it was a full success... I have modified my calls to connect() to have io_uring do it so that the softirqs on the sqpoll cpu gets called.

CPU 3 TCP connections have completed super fast... but something is still not correct... I am supposed to see several requests per second going out... Instead, the requests sending appears to be stalled and they appear to be sent out only every 3 seconds (I have a periodic timer to communicate with CPU3 thread)

lano1106 commented 2 months ago

it is getting late now....

i think that I know what I need to do now... I must make sure that there is only 1 NAPI device because the NAPI device is assigned at socket creation and it is not from which CPU the io is performed.... So if I leave 4 NAPI devices, my CPU3 sockets might be assigned to NAPI device 3...

however, you cannot control on which CPU, single NAPI device is going to be assigned. It is going to be CPU0... Therefore, change number 2 in my setup will be to make the sqpoll cpu match the NAPI device. If they do not match... calling do_softirq for another CPU won't help my case...

that being said... I am not sure if my patch will have a general appeal or if it will stay a hack for my particular case...

I would think that in more standard setups, it might significantly improve network latency as sqpoll would do the softirq instead of having to wait for the next hardware interrupt before the softirqs are serviced...

isilence commented 2 months ago

speaking of interrupts, quick question for you Pavel... I have saw that io_uring_create() explicit forbids using IORING_SETUP_SQPOLL with IORING_SETUP_COOP_TASKRUN.

IORING_SETUP_COOP_TASKRUN doesn't force the thread into the kernel when user code is running, but will do its job with the next syscall. SQPOLL doesn't run any userspace code by definition, so the flag doesn't make sense. You can say that SQPOLL already has a similar optimisation inside.

there does not seem to have the same test with IORING_SETUP_ATTACH_WQ and IORING_SETUP_COOP_TASKRUN (or it is tested elsewhere)... is it a valid flag combination? it seems like it is very similar to the first one...

IORING_SETUP_ATTACH_WQ attempts to share in-kernel thread pool (io-wq). If there is also SQPOLL flag, it'll try to share the SQPOLL thread as well. IOW, without passing the SQPOLL flag, IORING_SETUP_ATTACH_WQ would not turn it into SQPOLL. I don't any problem with that combination of flags.

isilence commented 2 months ago

Something that I have omitted to mention is that my kernel is built with CONFIG_PREEMPT_NONE.

Which might be a problem when tasks fight for CPU time. Not an explanation for the problem, but let's say there is just one CPU, SQPOLL and the userspace thread fight for it. SQPOLL executes some work and tries to notify the user task, but instead of yielding CPU it goes polling.

and I am starting to understand that without any interrupts on a CPU, the NET_TX_SOFTIRQ might have a hard time to run. but this makes me wonder why the situation is different for the CPU1 thread.

CONFIG_PREEMPT_NONE is about preemption of tasks running in kernel, like executing a syscall or SQPOLL thread. It shouldn't affect softirqs, but I'm unsure how it interacts with threaded irqs.

is the SQPOLL thread code does something on the behalf of its creator to invoke the softirqs that it does not do for attached rings?

No, nothing special. If there is some irq affinity at play though and irq comes back to the same CPU a request was issued from, then it'll come back to the SQPOLL's CPU. Worst case that CPU serves all irqs, but all depends on affinities, task affinities, whether they jump CPUs and such.

isilence commented 2 months ago

imagine that SQPOLL thread is running on an isolated CPU having no hardware interrupt... When will invoke_softirq() be called on that CPU following a write in the network stack?

When the NIC is done with putting packet onto the wire it'll fire a hardirq, you can assume that that's usually when softirq is run as well. If it's threaded though, the hardirq handler tells the scheduler and it's up to the scheduler to decided when to run softirq processing (but it should be of the highest priority).

AFAIK, a pending NET_TX_SOFTIRQ could wait for a very long time before being serviced...

fwiw, often drivers service the tx part in NET_RX_SOFTIRQ.

and I have just come to think about it... if it was possible for SQPOLL thread to explicitly awaken the softirq task, this might reduce io_uring tx latency even more...

even better instead of waking up the softirqd task, why not calling directly do_softirq() directly from sqpoll code?

That's what napi busy polling attempts to do, executing stuff that softirq would run otherwise.

isilence commented 2 months ago
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index b3722e5275e7..2a5cac1957d8 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -319,6 +319,10 @@ static int io_sq_thread(void *data)
                        if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
                                sqt_spin = true;
                }
+               if (local_softirq_pending() & (NET_TX_SOFTIRQ|NET_RX_SOFTIRQ)) {
+                       do_softirq();
+                       sqt_spin = true;
+               }
                if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
                        sqt_spin = true;

my first impression was that it was a full success... I have modified my calls to connect() to have io_uring do it so that the softirqs on the sqpoll cpu gets called.

I wonder if that's it, we schedule ksoftirqd task for processing but it cannot get CPU time because SQPOLL is polling and it's not preemptible. Can you get a perf profile of the SQPOLL task and/or the CPU it's running on?

isilence commented 2 months ago

here is the result of running your script... I am not sure what I am looking at...

@dt_to_exec tells how much time (nanosec) passes between softirq processing a packet and io_uring going into the socket trying to get it and return something to the user space. That's it unless there is a race in the script.

edit: that's actually for all request types, I should've filtered out anything but rx, and also marked if it's an SQPOLL thread.

I'm curious where there is no @time_[in,out]_tw, a perf profile might shed some light...

lano1106 commented 2 months ago

I have performed the experiment that I have described in the last entry... It seems like despite all my efforts to avoid having network softirqs raised on the isolated cpu, they are still raised and without hardware interrupts on those CPU, the networking io is having problems...

maybe I can confirm this by removing my nohz_full setting and see what it does...

the end goal is to remove those frequent 20-50uSecs interruptions from the latency sensitive threads...

lano1106 commented 2 months ago

Something that I have omitted to mention is that my kernel is built with CONFIG_PREEMPT_NONE.

Which might be a problem when tasks fight for CPU time. Not an explanation for the problem, but let's say there is just one CPU, SQPOLL and the userspace thread fight for it. SQPOLL executes some work and tries to notify the user task, but instead of yielding CPU it goes polling.

and I am starting to understand that without any interrupts on a CPU, the NET_TX_SOFTIRQ might have a hard time to run. but this makes me wonder why the situation is different for the CPU1 thread.

CONFIG_PREEMPT_NONE is about preemption of tasks running in kernel, like executing a syscall or SQPOLL thread. It shouldn't affect softirqs, but I'm unsure how it interacts with threaded irqs.

is the SQPOLL thread code does something on the behalf of its creator to invoke the softirqs that it does not do for attached rings?

No, nothing special. If there is some irq affinity at play though and irq comes back to the same CPU a request was issued from, then it'll come back to the SQPOLL's CPU. Worst case that CPU serves all irqs, but all depends on affinities, task affinities, whether they jump CPUs and such.

I get what you say but with no preemtion, if there is more than 1 runnable task on a cpu, the scheduler would do its job when the running task slice ends... my understanding is that with preemption, you reduce the scheduling latency when a task becomes runnable but this come at the cost of a non negligeable overhead.

what I am trying to achieve is to get rid of that overhead and compensate it by manually assign the ressources to my threads.

my point is if I reserve a whole NOHZ CPU for a single thread, there is not point paying the preemption overhead...

also keep in mind that my SQPOLL thread is running on an isolated CPU, 100% dedicated for its own usage. All the userspace processes using it are located on other CPUs...

that being said, making a test with preemption to see if this makes any difference is probably a good idea.

lano1106 commented 2 months ago

speaking of interrupts, quick question for you Pavel... I have saw that io_uring_create() explicit forbids using IORING_SETUP_SQPOLL with IORING_SETUP_COOP_TASKRUN.

IORING_SETUP_COOP_TASKRUN doesn't force the thread into the kernel when user code is running, but will do its job with the next syscall. SQPOLL doesn't run any userspace code by definition, so the flag doesn't make sense. You can say that SQPOLL already has a similar optimisation inside.

there does not seem to have the same test with IORING_SETUP_ATTACH_WQ and IORING_SETUP_COOP_TASKRUN (or it is tested elsewhere)... is it a valid flag combination? it seems like it is very similar to the first one...

IORING_SETUP_ATTACH_WQ attempts to share in-kernel thread pool (io-wq). If there is also SQPOLL flag, it'll try to share the SQPOLL thread as well. IOW, without passing the SQPOLL flag, IORING_SETUP_ATTACH_WQ would not turn it into SQPOLL. I don't any problem with that combination of flags.

I am learning something new with your explanation. I was not aware that attaching a ring to one that does not have a SQPOLL thread was something possible.

then, in that case, for the record, my problematic ring2 thread is configured with IORING_SETUP_ATTACH_WQ|IORING_SETUP_COOP_TASKRUN and it attach to my ring1 possessing a SQPOLL.

what libev does with my ring2 thread, it is calling the io_uring backend with a timeout value of 0 to achieve busy loop.

so with my setting, idk if liburing will actually make a syscall or if it will skip it... next I know that at each iteration libev calls gettimeofday() to update its current time... again, I am not sure if this considered a syscall because of vDSO....

so maybe this could be an interesting lead to investigate...

lano1106 commented 2 months ago

I'll need to reread your explanation or return to the source code with your explanation in mind... In my mind, IORING_SETUP_ATTACH_WQ was implying attaching to an SQPOLL ring...

I am not setting the IORING_SETUP_SQPOLL flag for ring2...

so everywhere I say that ring2 is attached to ring1 SQPOLL, maybe it is not... and this would be a simple explanation why this thread is having problem with its io with no HW IRQ, while ring1 thread is fine with the same CPU setup...

/*
 * getLoopBE()
 */
unsigned WS_Private::getLoopBE(struct io_uring_params *params,
                                      io_uring_probe *probe) const noexcept
{
    // Uncomment to have a SQPOLL thread for the Private WS connection.

    if (global_t::getInstance().usePubSqPoll()) {
/*
        params->flags         |= IORING_SETUP_SQPOLL;

        // Constant defined in ev_util.h
        params->sq_thread_idle = Kraken::SQ_THREAD_IDLE_VAL;

        params->flags |= IORING_SETUP_SQ_AFF;
        params->sq_thread_cpu = get_private_sq_thread_cpu();
*/
        // Uncomment to attach WQ to the master
        params->flags |= IORING_SETUP_ATTACH_WQ;
    }

    params->flags |= IORING_SETUP_COOP_TASKRUN;
    return Parent::getLoopBE(params, probe);
}
lano1106 commented 2 months ago

wow, you are right!

you need to specify IORING_SETUP_ATTACH_WQ|IORING_SETUP_SQPOLL...

I got my invalid understanding by browsing io_uring/sqpoll.c and seeing IORING_SETUP_ATTACH_WQ refered in io_sq_offload_create() long time ago... I have never give it a second thought about it since my config seemed to work perfectly.

that being said, I guess maybe io_uring_setup man page could be clarified... without digging the code, IMHO it is not obvious that you can combine IORING_SETUP_ATTACH_WQ|IORING_SETUP_SQPOLL together... or what the result will be if you do...

lano1106 commented 2 months ago

I have an excellent news!

with doing this: params->flags |= IORING_SETUP_ATTACH_WQ|IORING_SETUP_SQPOLL;

it did fully fix my problem. I got rid of every single interrupt on 2 out of 4 CPUs and networking works like a charm without those pesky 20-50uSec interruptions.

I am observing something strange however... now that SQPOLL thread effectively manage 2 rings, its CPU usage dropped from 99.7% to about 56%... This means that it goes to sleep much often... Any idea why having an extra ring to manage is causing this?

How do I fix this? increasing sq_thread_idle?

I really need to forbid SQPOLL thread to go to sleep...

first interrupt free results are in. The maximums are much worse... I guess it is because of the sqpoll thread going to sleep half of the time...

time in nanosecs

with interrupt:
avg:684.320, max:51441

without interrupt:
avg:651.340, max:316728
P     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
0    8968 lano1106  20   0 1318.1m 105.0m  15.1m R  55.3   0.7  41:55.18 iou-sqp-8932                                       
0       2 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 kthreadd                                           
0       3 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 pool_workqueue_release                             
0       4 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-rcu_gp                                   
0       5 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-sync_wq                                  
0       6 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-slub_flushwq                             
0       7 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-netns                                    
0       8 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/0:0-rcu_gp                                 
0       9 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri                        
0      10 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/0:1-rcu_gp                                 
0      12 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-mm_percpu_wq                             
0      14 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 rcu_tasks_rude_kthread                             
0      15 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 rcu_tasks_trace_kthread                            
0      16 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.48 ksoftirqd/0                                        
0      17 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.07 rcu_sched                                          
0      18 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:01.22 rcuog/0                                            
0      19 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcuos/0                                            
0      20 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcu_exp_par_gp_kthread_worker/0                    
0      21 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcu_exp_gp_kthread_worker                          
0      22 root      rt   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 migration/0                                        
0      23 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 idle_inject/0                                      
0      24 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 cpuhp/0                                            
0      31 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcuos/1                                            
0      38 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:01.23 rcuog/2                                            
0      39 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcuos/2                                            
0      46 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcuos/3                                            
0      59 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/9-acpi                                         
0      76 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/25-pciehp                                      
0      78 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/27-pciehp                                      
0      80 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/29-pciehp                                      
0      83 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/31-pciehp                                      
0      85 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/33-pciehp                                      
0      87 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/35-pciehp                                      
0      89 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/37-pciehp                                      
0      91 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/39-pciehp                                      
0      93 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/41-pciehp                                      
0      95 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/43-pciehp                                      
0      97 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/45-pciehp                                      
0      99 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/47-pciehp                                      
0     101 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/49-pciehp                                      
0     103 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/51-pciehp                                      
0     105 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/53-pciehp                                      
0     107 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/55-pciehp                                      
0     109 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/57-pciehp                                      
0     111 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/59-pciehp                                      
0     113 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/61-pciehp                                      
0     970 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/0:1H                                       

I know that ring1 is much busier than ring2.

ring2 not that much... and its low activity could make sqpoll become idle... but my expectation was that as long that there is at least a busy ring, a multi ring sqpoll thread would continue to spin... not the other way around...

I think that I get it...

            int ret = __io_sq_thread(ctx, cap_entries);

            if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
                sqt_spin = true;

does exactly what my expection is. It is just that maybe my sq_thread_idle is on the edge of timing out but a new event always sneak in before timing out... Having to service a second less busy ring add enough delay to push the thread over the timeout edge... I'll try to increase my idle value... This should be an easy to address issue!

lano1106 commented 2 months ago

now that my setup is working, I can circle back to my initial concern...

I am not seeing io_napi_busy_loop_should_end() considering any new pending sqe entries for stopping the busy polling and if SQPOLL is managing multiple rings, it might also look into all of them...

This concern of mine is making me using very small busy poll values where I would feel comfortable to be more generous if I knew that it will stop as soon as there is new SQEs coming in on any managed rings...

lano1106 commented 2 months ago

imagine that SQPOLL thread is running on an isolated CPU having no hardware interrupt... When will invoke_softirq() be called on that CPU following a write in the network stack?

When the NIC is done with putting packet onto the wire it'll fire a hardirq, you can assume that that's usually when softirq is run as well. If it's threaded though, the hardirq handler tells the scheduler and it's up to the scheduler to decided when to run softirq processing (but it should be of the highest priority).

AFAIK, a pending NET_TX_SOFTIRQ could wait for a very long time before being serviced...

fwiw, often drivers service the tx part in NET_RX_SOFTIRQ.

and I have just come to think about it... if it was possible for SQPOLL thread to explicitly awaken the softirq task, this might reduce io_uring tx latency even more... even better instead of waking up the softirqd task, why not calling directly do_softirq() directly from sqpoll code?

That's what napi busy polling attempts to do, executing stuff that softirq would run otherwise.

you are 100% right and this is what ENA driver is doing now that you are saying it:

static int ena_io_poll(struct napi_struct *napi, int budget)
{
...
    tx_work_done = ena_clean_tx_irq(tx_ring, tx_budget);
    /* On netpoll the budget is zero and the handler should only clean the
     * tx completions.
     */
    if (likely(budget))
        rx_work_done = ena_clean_rx_irq(rx_ring, napi, budget);
...
}

I had a blind spot on that fact while searching why my connect() were not going through... I associated busy polling with RX exclusively...

so that means my patch attempt is useless...

lano1106 commented 2 months ago
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index b3722e5275e7..2a5cac1957d8 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -319,6 +319,10 @@ static int io_sq_thread(void *data)
                        if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
                                sqt_spin = true;
                }
+               if (local_softirq_pending() & (NET_TX_SOFTIRQ|NET_RX_SOFTIRQ)) {
+                       do_softirq();
+                       sqt_spin = true;
+               }
                if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
                        sqt_spin = true;

my first impression was that it was a full success... I have modified my calls to connect() to have io_uring do it so that the softirqs on the sqpoll cpu gets called.

I wonder if that's it, we schedule ksoftirqd task for processing but it cannot get CPU time because SQPOLL is polling and it's not preemptible. Can you get a perf profile of the SQPOLL task and/or the CPU it's running on?

ok... I think that I get what you mean... are you saying that the scheduler will treat the io_uring kernel thread differently than a regular user process and it will not stop it unless the kernel is preemptible?

if it is the case, it is a valid concern... OTOH, do_softirq() has some sort of budget system... It will iterate through the softirq and service them but it will limit the number of iterations it does before awakening ksoftirq as they keep popping back (I think it is 8) so even if the ksoftirq was never scheduled, the softirq would continue to be serviced...

that being said, with the busy looping enabled, my patch idea did turn out to be a useless bad good idea as it was pointed out...

Now that my setup is running, I guess that getting a perf profile for sqpoll and or its CPU has become not needed anymore and tbh, this request is making me going out of my comfort zone... I might try to produce it nonetheless for the possible insights it might provide and to learn a new tool/skill in my kernel mastery...

lano1106 commented 2 months ago

here is the result of running your script... I am not sure what I am looking at...

@dt_to_exec tells how much time (nanosec) passes between softirq processing a packet and io_uring going into the socket trying to get it and return something to the user space. That's it unless there is a race in the script.

edit: that's actually for all request types, I should've filtered out anything but rx, and also marked if it's an SQPOLL thread.

I'm curious where there is no @time_[in,out]_tw, a perf profile might shed some light...

ok thx for the explanation... it somehow makes some sense at what I am looking at... and there is an output for every rings in use... I have effectively a 3rd ring in my system. I just omitted mentioning it because it is much less critical and is working just fine... I guess that the output showing the largest variance is the ring having issues...

I am not sure what you mean when you talk about @time_[in,out]_tw... Was it something that was supposed to be present in the output but wasn't there or something that could be added to the script for more insight...

anyway, I guess this has become irrelevant now that the mystery is solved...

but I can tell you that your bpf mastery is impressive... for my non-initiated eye, unless I spend some time studying it, what it does is not obvious...

lano1106 commented 2 months ago

now that my setup is working, I can circle back to my initial concern...

I am not seeing io_napi_busy_loop_should_end() considering any new pending sqe entries for stopping the busy polling and if SQPOLL is managing multiple rings, it might also look into all of them...

This concern of mine is making me using very small busy poll values where I would feel comfortable to be more generous if I knew that it will stop as soon as there is new SQEs coming in on any managed rings...

what do you think of doing that?

diff --git a/io_uring/napi.c b/io_uring/napi.c
index 4fd6bb331e1e..2a29cec2d219 100644
--- a/io_uring/napi.c
+++ b/io_uring/napi.c
@@ -130,6 +130,8 @@ static bool io_napi_busy_loop_should_end(void *data,
                return true;
        if (io_should_wake(iowq) || io_has_work(iowq->ctx))
                return true;
+       if (io_sqring_entries(iowq->ctx))
+               return true;
        if (io_napi_busy_loop_timeout(net_to_ktime(start_time),
                                      iowq->napi_busy_poll_dt))
                return true;

this address the concern for single ring sqpoll...

lano1106 commented 2 months ago

forget last comment. I have just understood that sqpoll was not using io_napi_busy_loop_should_end() and was totally ignoring napi_busy_poll_dt setting beside testing that it was not zero...

it takes me a lot of time to figure out things but I always finish by figuring out things...

lano1106 commented 2 months ago

yet... something is not right... How could sqpoll stop spinning with NAPI busy poll enabled with rings having permanently and respectively 20 and 2 sockets on the same NAPI device?

quick question... is the current stale detection system take into account MULTISHOT queries?

in the initial design... this is something that Hao and me did think about... I am not sure if this has been kept around when Stefan took over the feature... I did assume that it has been but what I am seeing makes me doubt... I'll double check this... I'll let you know if I find something...

I think recv MULTISHOT was not existing yet when I did work on the NAPI feature but I did place 1 io_add_napi() call in io_poll_check_events() that is not there anymore... multishot poll was there back then.

It was there in v5: https://lore.kernel.org/netdev/20221121191437.996297-2-shr@devkernel.io/

it has been removed with no explanation in v6: https://lore.kernel.org/netdev/20230201222254.744422-2-shr@devkernel.io/

I am giving this a shot:

diff --git a/io_uring/poll.c b/io_uring/poll.c
index 0a8e02944689..1a0ba13bb7f4 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -327,6 +327,7 @@ static int io_poll_check_events(struct io_kiocb *req, struct io_tw_state *ts)
                                io_req_set_res(req, mask, 0);
                                return IOU_POLL_REMOVE_POLL_USE_RES;
                        }
+                       io_napi_add(req);
                } else {
                        int ret = io_poll_issue(req, ts);
                        if (ret == IOU_STOP_MULTISHOT)

I will report back how it goes

lano1106 commented 2 months ago

ok... I just restarted my VPS... I cannot say if the last patch is good or bad but at best, it is insufficient. sqpoll thread is capped at 55% while I would expect it to be all in with NAPI busy polling...

lano1106 commented 2 months ago

after looking at io_poll_check_event, I have figured that a better location to add io_napi_add() was:

diff --git a/io_uring/poll.c b/io_uring/poll.c
index 0a8e02944689..1f63b60e85e7 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -347,6 +347,7 @@ static int io_poll_check_events(struct io_kiocb *req, struct io_tw_state *ts)
                v &= IO_POLL_REF_MASK;
        } while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);

+       io_napi_add(req);
        return IOU_POLL_NO_ACTION;
 }

but I find this function very opaque to the non initiated... I have figured that the only case where the code could reach the return statement at the end of the function outside the loop is either spurious wakeup or multishot.

that being said. It did not change anything in my situation... sqpoll CPU usage remains stuck at 54%. So kept looking around and this resulted in much more questions than answers...

there seems to be 2 types of multishot:

I am not sure if this is documented somewhere but it is not in the function. All that I have found was: 'fast poll multishot mode' in io_uring_types.h... So what does the A means? Why not FPOLL for Fast?

next, there is this while loop that is unclear why it is needed. It seems like some sort of lock-free protection against concurrent access but it is unclear where this concurrency is coming from...

            /*
             * We got woken with a mask, but someone else got to
             * it first. The above vfs_poll() doesn't add us back
             * to the waitqueue, so if we get nothing back, we
             * should be safe and attempt a reissue.
             */
            if (unlikely(!req->cqe.res)) {
                /* Multishot armed need not reissue */
                if (!(req->apoll_events & EPOLLONESHOT))
                    continue;
                return IOU_POLL_REISSUE;
            }

there is a comment explaining the situation but I have a hard time to convince myself that this is not causing an infinite loop.

next, there is req->poll_refs

not clear at all what this is... from the name, I would assume that it is some sort of ref_count... but it also look like some sort of bitmap... first thing that is made against it, is to check if it is equal to 1, but then after some sort of bitmask operation is performed on it.

next, there is what the caller will do with the return value.

IOU_POLL_NO_ACTION: the poll remain active and armed and io_poll_task_func() can be called back for it if NOT IOU_POLL_REQUEUE: the poll is unarmed IOU_POLL_REQUEUE: I get the idea of fairness but I really do not get it how this is achieve and why it works. That should definitely deserve a small explaining comment to help the reader

the poll remained armed. the request is place in task_work list that will most likely be processed by the sqpoll thread...

which will call io_poll_task_func()?

So I miss the point of what REQUEUE means or what it achieve exactly...

I have figured that IOU_POLL_REQUEUE was returned every MULTISHOT_MAX_RETRY(32) by io_recv_finish()... maybe because it is a relatively short period... if REQUEUE was meaning unarming/rearming the poll, this could have explained why the call to io_napi_add() in io_poll_check_events() has been dropped finally. but this does not appear to be the case... I have no clue what REQUEUE does...

I feel lost, confused and alone... I need help on this...

this made me wonder if my rings were actually configured for NAPI polling despite my strong belief that they are... there is no way to validate AFAIK... the NAPI info is not printed in the io_uring_show_fdinfo() function...

diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index b1e0e0d85349..092cd3d1e1b3 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -221,7 +221,17 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
                           cqe->user_data, cqe->res, cqe->flags);

        }
-
+#ifdef CONFIG_NET_RX_BUSY_POLL
+       if (ctx->napi_enabled) {
+               seq_puts(m, "NAPI:\tenabled\n");
+               seq_printf(m, "napi_busy_poll_dt:\t%u\n", ctx->napi_busy_poll_dt);
+               if (ctx->napi_prefer_busy_poll)
+                       seq_puts(m, "napi_prefer_busy_poll:\ttrue\n");
+               else
+                       seq_puts(m, "napi_prefer_busy_poll:\tfalse\n");
+       } else
+               seq_puts(m, "NAPI:\tdisabled\n");
+#endif
        spin_unlock(&ctx->completion_lock);
 }
 #endif
isilence commented 2 months ago

ok thx for the explanation... it somehow makes some sense at what I am looking at... and there is an output for every rings in use... I have effectively a 3rd ring in my system. I just omitted mentioning it because it is much less critical and is working just fine... I guess that the output showing the largest variance is the ring having issues...

I am not sure what you mean when you talk about @time_[in,out]_tw... Was it something that was supposed to be present in the output but wasn't there or something that could be added to the script for more insight...

It should have printed some more info about task_work, which is how multishots are normally run. And it didn't, maybe because it has never hit the path, or maybe I just mixed up functions.

isilence commented 2 months ago

ok... I think that I get what you mean... are you saying that the scheduler will treat the io_uring kernel thread differently than a regular user process and it will not stop it unless the kernel is preemptible?

IIRC, PREEMPT_NONE doesn't change userspace preemption. Regardless, it's a bit different in kernel and with PREEMPT_NONE it can only be preempted in some limited number of points. Might be not a problem, we have some handling, I hope waking ksoftirqd will set TIF_NEED_RESCHED for the SQPOLL thread so we can see it and yield CPU.

if it is the case, it is a valid concern... OTOH, do_softirq() has some sort of budget system... It will iterate through the softirq and service them but it will limit the number of iterations it does before awakening ksoftirq as they keep popping back (I think it is 8) so even if the ksoftirq was never scheduled, the softirq would continue to be serviced...

AFAIR, you can also force all softirqs to ksoftirqd

Now that my setup is running, I guess that getting a perf profile for sqpoll and or its CPU has become not needed anymore and tbh, this request is making me going out of my comfort zone... I might try to produce it nonetheless for the possible insights it might provide and to learn a new tool/skill in my kernel mastery...

Usually it's pretty easy

perf record --inherit -g -- <command to run your app> # or attach by pid -p <pid>
perf report > report.txt
isilence commented 2 months ago

quick question... is the current stale detection system take into account MULTISHOT queries?

There is, apparently. Need to look through, but seems it removes napi from polling after 60s after it being added. And I don't think it's renewed for multishots, you're right on this one.

after looking at io_poll_check_event, I have figured that a better location to add io_napi_add() was:

diff --git a/io_uring/poll.c b/io_uring/poll.c
index 0a8e02944689..1f63b60e85e7 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -347,6 +347,7 @@ static int io_poll_check_events(struct io_kiocb *req, struct io_tw_state *ts)
                v &= IO_POLL_REF_MASK;
        } while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);

+       io_napi_add(req);
        return IOU_POLL_NO_ACTION;
 }

Right, there should be something like that.

but I find this function very opaque to the non initiated... I have figured that the only case where the code could reach the return statement at the end of the function outside the loop is either spurious wakeup or multishot.

that being said. It did not change anything in my situation... sqpoll CPU usage remains stuck at 54%. So kept looking around and this resulted in much more questions than answers...

there seems to be 2 types of multishot:

* REQ_F_APOLL_MULTISHOT

* the other type.

"the other type" is IORING_OP_POLL_ADD, REQ_F_APOLL_MULTISHOT is everything else like recv and send.

I am not sure if this is documented somewhere but it is not in the function. All that I have found was: 'fast poll multishot mode' in io_uring_types.h... So what does the A means? Why not FPOLL for Fast?

Because sometimes I catch myself thinking that io_uring is the wild west of naming. "A" probably stands for "async". In practice it means that io_uring can go poll a request, like poll(2) but inside the kernel and then execute when there is data / etc.

...

So I miss the point of what REQUEUE means or what it achieve exactly...

That's throttling, avoids actual infinite loops when the device is giving you packets faster than this loop can handle.

I have figured that IOU_POLL_REQUEUE was returned every MULTISHOT_MAX_RETRY(32) by io_recv_finish()... maybe because it is a relatively short period... if REQUEUE was meaning unarming/rearming the poll, this could have explained why the call to io_napi_add() in io_poll_check_events() has been dropped finally. but this does not appear to be the case... I have no clue what REQUEUE does...

I feel lost, confused and alone... I need help on this...

this made me wonder if my rings were actually configured for NAPI polling despite my strong belief that they are... there is no way to validate AFAIK... the NAPI info is not printed in the io_uring_show_fdinfo() function...

We can add that, if you look at a perf profile you should also be able to find some traces polling.

lano1106 commented 2 months ago

IIRC, PREEMPT_NONE doesn't change userspace preemption. Regardless, it's a bit different in kernel and with PREEMPT_NONE it can only be preempted in some limited number of points. Might be not a problem, we have some handling, I hope waking ksoftirqd will set TIF_NEED_RESCHED for the SQPOLL thread so we can see it and yield CPU.

I did back off from the monkey business of wanting to do_softirqs(). The main problem was that sqpoll thread was not doing the i/o for the ring that I thought was attached to it. It was because of a misconfiguration from me.

but the whole exchange about the scheduler treating kernel threads differently than userspace processes made me doubt about what I thought that I know... There must be a a lot of subtleties that your greater kernel dev experience make you see that I don't but I did return to my 'Understanding the Linux kernel' book and according to it, from the point of view of the scheduler, they are treated the same way. The 2 main differences between a user process and a kernel thread are:

  1. kernel thread never go in User mode
  2. they never access memory below PAGE_OFFSET