RCU calls made by the NAPI busy poll code generates context switches occupying 50% of the time of the CPU hosting the sqp thread

axboe / liburing

Library providing helpers for the Linux kernel io_uring support

MIT License

2.85k stars 402 forks source link

RCU calls made by the NAPI busy poll code generates context switches occupying 50% of the time of the CPU hosting the sqp thread #1190

Closed lano1106 closed 2 months ago

lano1106 commented 2 months ago

When SQPOLL busy poll, it will totally let any attached rings starve. If those other rings are also configured for busy polling, by not servicing them for some time the IRQ will be reenabled and this will pump new events that SQPOLL will notice and this will interrupt the busy poll. (if you play with the napi_defer_hard_irqs and gro_flush_timeout settings)

However, if there is only 1 NAPI device shared by all the io_uring by doing something like: ethtool -L enp39s0 combined 1

by polling the device for a io_uring for a very long time it will also inhibit the device from generating interrupts that would normally break the busy loop so that sqpoll process the other rings...

My current workaround is to specify a very small busy poll interval... I currently have no better idea but the ideal solution would be looping through all the attached busy looping io rings in the same way that all the NAPI devices of the context are iterated to ensure fairness...

lano1106 commented 2 months ago

thx a lot for all your answers and your help...

I am about to collect perf data...

lano1106 commented 2 months ago

I have just tested my small fdinfo patch and NAPI polling is correctly set:

ip-172-31-39-89 /proc/772/fdinfo # cat 14
pos:    0
flags:  02000002
mnt_id: 16
ino:    10243
SqMask: 0xff
SqHead: 633
SqTail: 633
CachedSqHead:   633
CqMask: 0x3fff
CqHead: 430250
CqTail: 430250
CachedCqTail:   430250
SQEs:   0
CQEs:   0
SqThread:   885
SqThreadCpu:    0
SqTotalTime:    52793826
SqWorkTime: 3590465
UserFiles:  0
UserBufs:   0
PollList:
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=6, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=6, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
  op=10, task_works=0
CqOverflowList:
NAPI:   enabled
napi_busy_poll_to:  1
napi_prefer_busy_poll:  true

the one thing that I find odd is the big difference between SqTotalTime and SqWorkTime...

do that look normal to you Pavel?

lano1106 commented 2 months ago

is there anyway to get some symbol names from perf output?

    17.97%     0.00%  iou-sqp-772   [unknown]              [k] 0000000000000000
            |
            ---0
               |
                --17.96%--0xffffffffbb001511
                          0xffffffffbb0a0d9d
                          |
                          |--16.47%--0xffffffffbb663e0c
                          |          |
                          |          |--12.37%--0xffffffffbb113398
                          |          |          |
                          |          |          |--11.57%--0xffffffffbb15872e
                          |          |          |          |
                          |          |          |          |--5.89%--0xffffffffbb15864d
                          |          |          |          |          |
                          |          |          |          |           --4.40%--0xffffffffbb158527
                          |          |          |          |                     |
                          |          |          |          |                      --4.34%--0xffffffffbb0a03a5
                          |          |          |          |                                |
                          |          |          |          |                                 --4.18%--0xffffffffbbd2503d
                          |          |          |          |                                           |
                          |          |          |          |                                            --3.36%--0xffffffffbbd2505d
                          |          |          |          |                                                      |
                          |          |          |          |                                                       --1.17%--0xffffffffbbe00c86
                          |          |          |          |                                                                 |
                          |          |          |          |                                                                  --0.65%--0xffffffffbbd24cfd
                          |          |          |          |                                                                            |
                          |          |          |          |                                                                             --0.55%--0xffffffffbb0fd26c
                          |          |          |          |
                          |          |          |           --4.57%--0xffffffffbb1586f2
                          |          |          |                     |
                          |          |          |                     |--2.03%--0xffffffffbb1360df
                          |          |          |                     |          |
                          |          |          |                     |           --1.60%--0xffffffffbbe00c86
                          |          |          |                     |                     |
                          |          |          |                     |                     |--0.89%--0xffffffffbbd24cfd
                          |          |          |                     |                     |          |
                          |          |          |                     |                     |           --0.76%--0xffffffffbb0fd26c
                          |          |          |                     |                     |                     |
                          |          |          |                     |                     |                      --0.64%--0xffffffffbb0fca51
                          |          |          |                     |                     |                                |
                          |          |          |                     |                     |                                 --0.58%--0xffffffffbbad2a58
                          |          |          |                     |                     |                                           |
                          |          |          |                     |                     |                                            --0.57%--0xffffffffbbad2415
                          |          |          |                     |                     |                                                      |
                          |          |          |                     |                     |                                                       --0.56%--ena_io_poll
                          |          |          |                     |                     |
                          |          |          |                     |                      --0.64%--0xffffffffbbd24cf8
                          |          |          |                     |
                          |          |          |                     |--1.16%--0xffffffffbb136106

isilence commented 2 months ago

IIRC, PREEMPT_NONE doesn't change userspace preemption. Regardless, it's a bit different in kernel and with PREEMPT_NONE it can only be preempted in some limited number of points. Might be not a problem, we have some handling, I hope waking ksoftirqd will set TIF_NEED_RESCHED for the SQPOLL thread so we can see it and yield CPU.

I did back off from the monkey business of wanting to do_softirqs(). The main problem was that sqpoll thread was not doing the i/o for the ring that I thought was attached to it. It was because of a misconfiguration from me.

but the whole exchange about the scheduler treating kernel threads differently than userspace processes made me doubt about what I thought that I know... There must be a a lot of subtleties that your greater kernel dev experience make you see that I don't but I did return to my 'Understanding the Linux kernel' book and according to it, from the point of view of the scheduler, they are treated the same way. The 2 main differences between a user process and a kernel thread are:

You're talking about kthreads, it never runs any user code and should be invisible to the kernel, but I refer to code executed in the kernel. For example, it can be a normal user space task that is in a syscall. The scheduler likely doesn't differentiate user threads from kthreads for making a scheduling decision, but it doesn't mean you can always preempt a task running in kernel unlike when it's executing the user code.

isilence commented 2 months ago

the one thing that I find odd is the big difference between SqTotalTime and SqWorkTime... do that look normal to you Pavel?

It might wrong, but it supposed to tell how much time the SQPOLL thread was actually doing anything useful vs just busily polling with nothing to do. E.g. only 7% was spent posting CQEs / submitting requests, and looking at the implementation also busy polling napi.

lano1106 commented 2 months ago

I have figured how to get symbol names...

unless idk how to read the file, I do not find it very informative... report.txt

isilence commented 2 months ago

is there anyway to get some symbol names from perf output?

Assuming you already tried sudo, try to pass --vmlinux=<kernel_sources/vmlinux>

Also, the trace looks too short to be true.

lano1106 commented 2 months ago

I did let it run maybe 5 minutes.

you think that my packager did strip the binary? or it might have gotten the symbols from /proc/kallsyms I got them by doing sudo...

I have regenerated with: perf report --vmlinux=/home/lano1106/dev/packages/ec2-packages/linux-ec2/src/linux/vmlinux > report.txt -rwxr-xr-x 1 lano1106 lano1106 357466376 Jul 29 13:46 vmlinux

it did generate pretty much the same thing...

lano1106 commented 2 months ago

report_from_src.txt

something that I did omit to mention is that I did specify the sqpoll pid which is different than the main process pid... I am not sure if this could have made a difference...

is it me or getrusage() is taking suspiciously a lot of time... I did not expect this function to represent 16% of the execution time...

lano1106 commented 2 months ago

I am going to start adding traces to debug the issue.

do you agree that if NAPI polling is properly set with permanent sockets monitored, io_sq_thread() should never reach prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);

isilence commented 2 months ago

report_from_src.txt

something that I did omit to mention is that I did specify the sqpoll pid which is different than the main process pid... I am not sure if this could have made a difference...

Maybe it does just spin there doing nothing useful... I wonder what if you pass "--percent-limit=0" to perf report.

is it me or getrusage() is taking suspiciously a lot of time... I did not expect this function to represent 16% of the execution time...

It's busy polling, and we don't see any real activity there, so that getrusage() is called every loop, and the only thing it does is looping there, so it's not surprising. And apparently it's 16.47% out of 17.97% for that SQPOLL task.

Would be interesting to see what getrusage() amounts to if we have enough of work to do.

isilence commented 2 months ago

you think that my packager did strip the binary? or it might have gotten the symbols from /proc/kallsyms I got them by doing sudo...

The symbols are fine, the trace itself looks reasonable

something that I did omit to mention is that I did specify the sqpoll pid which is different than the main process pid... I am not sure if this could have made a difference...

Looks like "perf record" has a separate --tid option

do you agree that if NAPI polling is properly set with permanent sockets monitored, io_sq_thread() should never reach prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);

If there is enough of traffic going on for your sockets, yes.

lano1106 commented 2 months ago

enough is not that much... you could get 1 packet per minute and it should be enough to keep the NAPI device in the list... I'm glad to hear that my load is absolutely nothing for what the processor would be able to process but we are talking about hundreds of packets per minute so, I am very confident that there is enough activity to keep the devices in the context list... worse case scenario, I think that part of the WebSocket protocol, the remote endpoint is sending a ping every second if the link is inactive... report.txt.gz

I am about run this:

diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index b3722e5275e7..96a8f914a8ed 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -273,6 +273,8 @@ static int io_sq_thread(void *data)
        struct io_ring_ctx *ctx;
        struct rusage start;
        unsigned long timeout = 0;
+       uint64_t sleepCount = 0;
+       uint64_t spinCount = 0;
        char buf[TASK_COMM_LEN];
        DEFINE_WAIT(wait);

@@ -333,8 +335,10 @@ static int io_sq_thread(void *data)
                                mutex_lock(&sqd->lock);
                                sqd->sq_cpu = raw_smp_processor_id();
                        }
+                       ++spinCount;
                        continue;
                }
+               ++sleepCount;

                prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);
                if (!io_sqd_events_pending(sqd) && !io_sq_tw_pending(retry_list)) {
@@ -375,6 +379,7 @@ static int io_sq_thread(void *data)
                finish_wait(&sqd->wait, &wait);
                timeout = jiffies + sqd->sq_thread_idle;
        }
+       printk(KERN_INFO "io_sq_thread() spin:%llu sleep:%llu\n", spinCount, sleepCount);

        if (retry_list)
                io_sq_tw(&retry_list, UINT_MAX);

it works as expected... [ 3995.789639] io_sq_thread() spin:3349742697 sleep:5

I just come to think about it, this problem did appear when I added correctly the 2nd ring to the sqpoll thread... where else can the thread go to sleep?

mutex_lock in __io_sq_thread()?

lano1106 commented 2 months ago

I have made another round of instrumentation... The result is curious and I am sharing it ASAP fresh without me having made any analysis on it as I hope this ring a bell in someone so we can fix this problem ASAP....

diff --git a/io_uring/napi.c b/io_uring/napi.c
index 8c18ede595c4..93aafc149cea 100644
--- a/io_uring/napi.c
+++ b/io_uring/napi.c
@@ -196,7 +196,6 @@ void io_napi_init(struct io_ring_ctx *ctx)
 void io_napi_free(struct io_ring_ctx *ctx)
 {
    struct io_napi_entry *e;
-   LIST_HEAD(napi_list);
    unsigned int i;

    spin_lock(&ctx->napi_lock);
@@ -313,21 +312,26 @@ void __io_napi_busy_loop(struct io_ring_ctx *ctx, struct io_wait_queue *iowq)
  *
  * Splice of the napi list and execute the napi busy poll loop.
  */
-int io_napi_sqpoll_busy_poll(struct io_ring_ctx *ctx)
+int io_napi_sqpoll_busy_poll(struct io_ring_ctx *ctx, struct sqpoll_busy_stats *stats)
 {
-   LIST_HEAD(napi_list);
    bool is_stale = false;

+   if (stats)
+       ++stats->poll_call;
    if (!READ_ONCE(ctx->napi_busy_poll_to))
        return 0;
    if (list_empty_careful(&ctx->napi_list))
        return 0;

+   if (stats)
+       ++stats->poll_done;
    rcu_read_lock();
    is_stale = __io_napi_do_busy_loop(ctx, NULL);
    rcu_read_unlock();

    io_napi_remove_stale(ctx, is_stale);
+   if (stats)
+       stats->list_stall += list_empty_careful(&ctx->napi_list);
    return 1;
 }

diff --git a/io_uring/napi.h b/io_uring/napi.h
index 6fc0393d0dbe..1636d170da73 100644
--- a/io_uring/napi.h
+++ b/io_uring/napi.h
@@ -20,7 +20,16 @@ void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock);
 void __io_napi_adjust_timeout(struct io_ring_ctx *ctx,
        struct io_wait_queue *iowq, struct timespec64 *ts);
 void __io_napi_busy_loop(struct io_ring_ctx *ctx, struct io_wait_queue *iowq);
-int io_napi_sqpoll_busy_poll(struct io_ring_ctx *ctx);
+
+struct sqpoll_busy_stats
+{
+   uint64_t poll_call;
+   uint64_t poll_done;
+   uint64_t list_stall;
+   uint64_t svc_dur;
+};
+
+int io_napi_sqpoll_busy_poll(struct io_ring_ctx *ctx, struct sqpoll_busy_stats *stats);

 static inline bool io_napi(struct io_ring_ctx *ctx)
 {
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index b3722e5275e7..44445ca0f514 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -166,7 +166,7 @@ static inline bool io_sqd_events_pending(struct io_sq_data *sqd)
    return READ_ONCE(sqd->state);
 }

-static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
+static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries, struct sqpoll_busy_stats *stats)
 {
    unsigned int to_submit;
    int ret = 0;
@@ -196,7 +196,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
        mutex_unlock(&ctx->uring_lock);

        if (io_napi(ctx))
-           ret += io_napi_sqpoll_busy_poll(ctx);
+           ret += io_napi_sqpoll_busy_poll(ctx, stats);

        if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait))
            wake_up(&ctx->sqo_sq_wait);
@@ -272,6 +272,7 @@ static int io_sq_thread(void *data)
    struct io_sq_data *sqd = data;
    struct io_ring_ctx *ctx;
    struct rusage start;
+   struct sqpoll_busy_stats poll_stats[2];
    unsigned long timeout = 0;
    char buf[TASK_COMM_LEN];
    DEFINE_WAIT(wait);
@@ -301,9 +302,12 @@ static int io_sq_thread(void *data)
    audit_uring_entry(IORING_OP_NOP);
    audit_uring_exit(true, 0);

+   memset(poll_stats, 0, sizeof(poll_stats));
+
    mutex_lock(&sqd->lock);
    while (1) {
        bool cap_entries, sqt_spin = false;
+       int i = 0;

        if (io_sqd_events_pending(sqd) || signal_pending(current)) {
            if (io_sqd_handle_event(sqd))
@@ -314,7 +318,18 @@ static int io_sq_thread(void *data)
        cap_entries = !list_is_singular(&sqd->ctx_list);
        getrusage(current, RUSAGE_SELF, &start);
        list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
-           int ret = __io_sq_thread(ctx, cap_entries);
+           struct sqpoll_busy_stats *pCurStats = NULL;
+           int ret;
+           u64 startTs;
+
+           if (i < 2) {
+               pCurStats = &poll_stats[i++];
+               startTs = ktime_get_ns();
+           }
+           ret = __io_sq_thread(ctx, cap_entries, pCurStats);
+           if (pCurStats) {
+               pCurStats->svc_dur += ktime_get_ns() - startTs;
+           }

            if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
                sqt_spin = true;
@@ -375,6 +390,10 @@ static int io_sq_thread(void *data)
        finish_wait(&sqd->wait, &wait);
        timeout = jiffies + sqd->sq_thread_idle;
    }
+   for (int i = 0; i < 2; ++i)
+       printk(KERN_INFO "ring %d NAPI stats call:%llu done:%llu stall:%llu dur:%llu\n",
+                       i, poll_stats[i].poll_call, poll_stats[i].poll_done, poll_stats[i].list_stall,
+           poll_stats[i].svc_dur);

    if (retry_list)
        io_sq_tw(&retry_list, UINT_MAX);

[ 9759.047592] ring 0 NAPI stats call:150352 done:150352 stall:0 dur:651818363433
[ 9759.047600] ring 1 NAPI stats call:   587 done:   587 stall:0 dur:696062839394

one thing... I cannot explain myself how it is possible that each ring has a so much different number of polls made... despite this very different number of poll, it took about the same duration and ring 1 took even more time for less polling...

we are getting closer to the issue... I am not seeing it yet but it is about to show us its ugly butt...

I got the time to think quickly. anyway, there is not 3000 potential explanations for what we are seeing...

the only way that 1 ring could have less polls than the other, it is if the majority of the time, it does not have any sqe to process, but when it does, they cause the sqpoll thread to go to sleep for a very long time.

It makes sense with what I know of my rings usage pattern.

ring 0 has almost exclusively inboung(maybe 99% inbount, 1% outbound) traffic. Fortunately, I am using the buffer rings so returning this massive amount of buffers is not generating many sqe... (in that case, I wonder what all those sqes are... I need to remember how to trace them...) ring 1 has much less traffic but it is generating (50% outbound, 50% inbound.)

So there is something in the requested OP that makes the thread wait and go to sleep... I feel like the next step is to trace to see what are the sqes circulating...

lano1106 commented 2 months ago

it is funny how you can misinterpret data because of bad assumption. I thought that ring 1 context was pushed back on the sqd context list. They are inserted at the list head. So they are in reverse order of insertion and when my trace says ring 0, it is ring 1 and it says ring 0, it is ring 1.

I'll need to filter events to get the sqe types of the low volume sqe ring because it is mysterious that it takes so much time to process so little requests...

I have outputted the trace pipe for few minutes after having enabled the io_uring_submit_req tracepoint. I have seen no sqe at all for one of the ring... It carburates strictly on multishot completion... When that ring was alone, sqpoll thread was spinning a 100%... processing task work does not put the thread on sleep...

it is the submission part now present after having added a second ring to manage that is inducing a 50% sleep...

this realization made me realize something... Does sqpoll really do NAPI busy loop if it manage a ring that is 100% multishot? io_uring_submit.txt

the fd is encoded in the user_data... POLL_ADD is libev rearming the POLLOUT event on its eventfd... the other repeating submission is a SEND request of TCP socket 26.

this made me realise 2 things:

libev code could be modified to use multishot poll
I can totally bypass the eventfd signaling since that thread is also busy polling the libev loop backend

My guess is that the SEND requests issue function where the sleep is occurring. If not, it must when some high contention mutex is acquired...

isilence commented 2 months ago

enough is not that much... you could get 1 packet per minute and it should be enough to keep the NAPI device in the list... I'm glad to hear that my load is absolutely nothing for what the processor would be able to process but we are talking about hundreds of packets per minute so, I am very confident that there is enough activity to keep the devices in the context list... worse case scenario, I think that part of the WebSocket protocol, the remote endpoint is sending a ping every second if the link is inactive...

It's not that I'm concerned that napi is kept around, as rather that the sqpoll thread is doing something weird. It feels like the thread you was profiling is not the one driving traffic or so.

isilence commented 2 months ago

I just come to think about it, this problem did appear when I added correctly the 2nd ring to the sqpoll thread... where else can the thread go to sleep? mutex_lock in __io_sq_thread()?

It'd go sleeping if the scheduler decides that its time is over, look for cond_resched() in sqpoll.c.

it is the submission part now present after having added a second ring to manage that is inducing a 50% sleep...

So maybe the CPU is just shared between 2 SQPOLL threads, I believe you're saying that's your setup for this round of testing, each takes 50% of CPU. Why not ATTACH_WQ|SQPOLL?

A side note, be aware that attaching is only possible when io_uring instances are run by tasks in the same task group, otherwise it'll create a new SQPOLL thread.

this realization made me realize something... Does sqpoll really do NAPI busy loop if it manage a ring that is 100% multishot? io_uring_submit.txt

Yes, it should, assuming we don't remove the napi. softirq / busy polling is what generates packets in the socket and notifies polling the recv (incl multishot) via polling.

the fd is encoded in the user_data... POLL_ADD is libev rearming the POLLOUT event on its eventfd... the other repeating submission is a SEND request of TCP socket 26.

this made me realise 2 things:
1. libev code could be modified to use multishot poll

Multishot poll is problematic, unlike all other multishots doing real work (receiving data, accepting a socket, etc.).

2. I can totally bypass the eventfd signaling since that thread is also busy polling the libev loop backend
My guess is that the SEND requests issue function where the sleep is occurring. If not, it must when some high contention mutex is acquired...

Do you share the socket between threads?

lano1106 commented 2 months ago

I just come to think about it, this problem did appear when I added correctly the 2nd ring to the sqpoll thread... where else can the thread go to sleep? mutex_lock in __io_sq_thread()?

It'd go sleeping if the scheduler decides that its time is over, look for cond_resched() in sqpoll.c.

it would only do that if there were other runnable tasks on the processor. This is not the case. SQPOLL is the only task assigned and running on this isolated CPU. If I detach ring 2, with the exact same set of tasks on the CPU, SQPOLL can takes 100% but as soon as the code path pattern changes, the scheduler would decide to schedule it out 50% of the time?

anyway, I have just rediscovered the tracepoints system... there must be one showing what the scheduler do and through which task states the sqpoll thread go. If it is scheduled out, the traces will be different than if it goes to sleep after putting itself on some wait list...

So maybe the CPU is just shared between 2 SQPOLL threads, I believe you're saying that's your setup for this round of testing, each takes 50% of CPU. Why not ATTACH_WQ|SQPOLL?

that was the initial problem. I was misconfiguring. I thought that I was attaching ring2 to the SQPOLL thread but I was not because the flags passed at the ring creation were: ATTACH_WQ|COOP_TASKRUN

it was fixed by correcting them to ATTACH_WQ|SQPOLL

no. I would see it. This is the output of top. The tasks are filtered by pressing lowercase 'o', and enter the filter 'P=0' If the sqpoll was scheduled out, I would expect to see some other task taking some CPU time...

P     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND              
0     935 lano1106  20   0 1311.4m 105.3m  15.1m R  55.3   0.7 271:01.44 iou-sqp-904          
0       2 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 kthreadd             
0       3 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 pool_workqueue_rele+ 
0       4 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-rcu_gp     
0       5 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-sync_wq    
0       6 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-slub_flus+ 
0       7 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-netns      
0       8 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/0:0-rcu_gp   
0       9 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/0:0H-events+ 
0      10 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/0:1-rcu_gp   
0      12 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/R-mm_percpu+ 
0      14 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 rcu_tasks_rude_kthr+ 
0      15 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 rcu_tasks_trace_kth+ 
0      16 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:02.22 ksoftirqd/0          
0      17 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:00.06 rcu_sched            
0      18 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:04.93 rcuog/0              
0      19 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcuos/0              
0      20 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcu_exp_par_gp_kthr+ 
0      21 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcu_exp_gp_kthread_+ 
0      22 root      rt   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 migration/0          
0      23 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 idle_inject/0        
0      24 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 cpuhp/0              
0      31 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcuos/1              
0      37 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/2:0H         
0      38 root      20   0    0.0m   0.0m   0.0m I   0.0   0.0   0:04.84 rcuog/2              
0      39 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcuos/2              
0      46 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 rcuos/3              
0      59 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/9-acpi           
0      76 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/25-pciehp        
0      79 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/27-pciehp        
0      81 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/29-pciehp        
0      83 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/31-pciehp        
0      85 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/33-pciehp        
0      87 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/35-pciehp        
0      89 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/37-pciehp        
0      91 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/39-pciehp        
0      93 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/41-pciehp        
0      95 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/43-pciehp        
0      97 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/45-pciehp        
0      99 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/47-pciehp        
0     101 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/49-pciehp        
0     103 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/51-pciehp        
0     105 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/53-pciehp        
0     107 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/55-pciehp        
0     109 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/57-pciehp        
0     111 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/59-pciehp        
0     113 root     -51   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 irq/61-pciehp        
0     871 root       0 -20    0.0m   0.0m   0.0m I   0.0   0.0   0:00.00 kworker/0:1H

notice that the only task in RUNNING state is sqpoll... I am not sure IDLE means... (especially the NOLOAD part...)

#define TASK_IDLE (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)

but there is no chance that the scheduler schedule them...

A side note, be aware that attaching is only possible when io_uring instances are run by tasks in the same task group, otherwise it'll create a new SQPOLL thread.

I am aware of that. I went through loops with a barrier system to create threads at initialization to explicitly make them create their rings and once it done, stop at the barrier until the main thread completes the rest of the program initialization. Initially, it was the main thread that was creating the rings and then handed them out to the program threads for their exclusive usage but that was not good enough...

Yes, it should, assuming we don't remove the napi. softirq / busy polling is what generates packets in the socket and notifies polling the recv (incl multishot) via polling.

How?

the napi poll function is inside the block:

if (!wq_list_empty(&ctx->iopoll_list) || to_submit) {

if there is no sqe entries or iopoll list, there will be no busy polling, IMHO. (This inspire me another 1 liner patch)

Multishot poll is problematic, unlike all other multishots doing real work (receiving data, accepting a socket, etc.).

ok. I am taking good note and won't go in that direction then. I'll just make sure that the eventfd is not signaled if it is not needed.

Do you share the socket between threads?

it is a big no. sockets are private and exclusive to each thread. This is one of my design challenge. If one thread wants something sent out on the other thread socket, it will notify it by placing a request for that on a queue. This is adding latency to the reaction and I must be sure the recipient thread is on standby for new such request on the ITC queue when it comes in The eventfd POLL_ADD sqe you see is that. There are much more notifications than that. libev optimize them away by using an atomic flag but when it is about to enter its backend to wait for new events, it sets a variable pipe_write_wanted to 1. The other variable is pipe_write_skipped. It is really very beautiful code and among one of the only legit usage of the sequential consistent memory ordering with MFENCE that I have seen. I simply think that I can push further the optimization by considering the timeout value is 0 even when the thread is entering its current pipe_write_wanted block.

lano1106 commented 2 months ago

Yes, it should, assuming we don't remove the napi. softirq / busy polling is what generates packets in the socket and notifies polling the recv (incl multishot) via polling.

How?

the napi poll function is inside the block:

if (!wq_list_empty(&ctx->iopoll_list) || to_submit) {

if there is no sqe entries or iopoll list, there will be no busy polling, IMHO. (This inspire me another 1 liner patch)

FYI, 6 hours ago, I did set this:

$ sudo cat /sys/kernel/debug/tracing/trace_pipe | grep -v "ring 000000006359f8dd" | tee io_uring_submit_filtered.txt

(I am not able to make the filter echo "ctx != 0x000000006359f8dd" > /sys/kernel/tracing/events/io_uring/io_uring_submit_req/filter to work)

and in the last 6 hours, my ring0 did receive 0 SQE...

lano1106 commented 2 months ago

if (!wq_list_empty(&ctx->iopoll_list) || to_submit) {

I am currently obsessed by optimizing my cache usage and this is the type of things that I have trained my eye to see... unless the current order was intentional to optimize the iopoll case, I think that reversing the evaluation could reduce the thread memory accesses and possibly remove a branch evaluation in the vast majority of the cases where !wq_list_empty(&ctx->iopoll_list) will always be false...

and to_submit is/could already be in a register when the if statement is reached

what do you think of my comment?

isilence commented 2 months ago

Yes, it should, assuming we don't remove the napi. softirq / busy polling is what generates packets in the socket and notifies polling the recv (incl multishot) via polling.

How?

I misunderstood you on this one. I'm saying that semantically they're pretty much compatible ...

the napi poll function is inside the block:

if (!wq_list_empty(&ctx->iopoll_list) || to_submit) {

... however, wasn't looking at the actual call sites. Yes, it is mishandled here. I'd even say it should be a separate loop, so we can first queue up all requests and only then poll.

isilence commented 2 months ago

and in the last 6 hours, my ring0 did receive 0 SQE...

That's what I mean by it's look like the SQPOLL you're looking at either does nothing or at least so little that doesn't even show up in the trace.

lano1106 commented 2 months ago

Yes, it should, assuming we don't remove the napi. softirq / busy polling is what generates packets in the socket and notifies polling the recv (incl multishot) via polling.

How?

I misunderstood you on this one. I'm saying that semantically they're pretty much compatible ...

the napi poll function is inside the block: if (!wq_list_empty(&ctx->iopoll_list) || to_submit) {

... however, wasn't looking at the actual call sites. Yes, it is mishandled here. I'd even say it should be a separate loop, so we can first queue up all requests and only then poll.

I am currently testing this:

diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index b3722e5275e7..fd62f63ee440 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -195,15 +195,14 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
                        ret = io_submit_sqes(ctx, to_submit);
                mutex_unlock(&ctx->uring_lock);

-               if (io_napi(ctx))
-                       ret += io_napi_sqpoll_busy_poll(ctx);
-
                if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait))
                        wake_up(&ctx->sqo_sq_wait);
                if (creds)
                        revert_creds(creds);
        }

+       if (io_napi(ctx))
+               ret += io_napi_sqpoll_busy_poll(ctx);
        return ret;
 }

I was about to double test io_napi(ctx) but I do not think that NAPI busy poll do any credential validation...

tell me what you think, I will adjust accordingly

lano1106 commented 2 months ago

and in the last 6 hours, my ring0 did receive 0 SQE...

That's what I mean by it's look like the SQPOLL you're looking at either does nothing or at least so little that doesn't even show up in the trace.

and yet, this is my busiest ring. Processing a million packets every 5 minutes or so without the need of sending a single SQE thank to multishot. Would the work made in processing the resulting task work show up in the traces?

isilence commented 2 months ago

and in the last 6 hours, my ring0 did receive 0 SQE...

That's what I mean by it's look like the SQPOLL you're looking at either does nothing or at least so little that doesn't even show up in the trace.

and yet, this is my busiest ring. Processing a million packets every 5 minutes or so without the need of sending a single SQE thank to multishot. Would the work made in processing the resulting task work show up in the traces?

Yes, it'll be there. And there was also nothing printed by the bpftrace script. For comparison, slightly modifying the script to print traces and trying it with a random sqpoll workload:

...
@time_out_tw[0xffff8faab17e58c0, 
    io_handle_tw_list+5
    tctx_task_work_run+81
    io_sq_tw+110
    io_sq_thread+932
    ret_from_fork+52
    ret_from_fork_asm+26
]: 500098

Maybe you was recorded a profile / running bpftrace only during the time when it was inactive and missed when it was processing the work or so? It seems you're saying it's bursty and has long intervals of inactivity, hmm?

isilence commented 2 months ago

Yes, it should, assuming we don't remove the napi. softirq / busy polling is what generates packets in the socket and notifies polling the recv (incl multishot) via polling.

How?

I misunderstood you on this one. I'm saying that semantically they're pretty much compatible ...

the napi poll function is inside the block: if (!wq_list_empty(&ctx->iopoll_list) || to_submit) {

... however, wasn't looking at the actual call sites. Yes, it is mishandled here. I'd even say it should be a separate loop, so we can first queue up all requests and only then poll.

I am currently testing this:
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index b3722e5275e7..fd62f63ee440 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -195,15 +195,14 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
                        ret = io_submit_sqes(ctx, to_submit);
                mutex_unlock(&ctx->uring_lock);

-               if (io_napi(ctx))
-                       ret += io_napi_sqpoll_busy_poll(ctx);
-
                if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait))
                        wake_up(&ctx->sqo_sq_wait);
                if (creds)
                        revert_creds(creds);
        }

+       if (io_napi(ctx))
+               ret += io_napi_sqpoll_busy_poll(ctx);
        return ret;
 }
I was about to double test io_napi(ctx) but I do not think that NAPI busy poll do any credential validation...

tell me what you think, I will adjust accordingly

I think makes more sense to keep it in a separate loop after submitting everything there is to submit

io_sq_thread() {
    list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
        __io_sq_thread(ctx);
    io_sq_tw();
    list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
        run_napi(ctx);

}

isilence commented 2 months ago

Do you share the socket between threads?

it is a big no. sockets are private and exclusive to each thread.

Great, then there shouldn't be really any long waiting for sending, the only ones I can think of is socket lock and allocations. That's unless someone screwed it up, which we can actually try to trace.

lano1106 commented 2 months ago

I think makes more sense to keep it in a separate loop after submitting everything there is to submit
io_sq_thread() {
  list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
      __io_sq_thread(ctx);
  io_sq_tw();
  list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
      run_napi(ctx);

}

ok. it makes sense... I guess the ordering is moot since it is an infinite loop but I would have thought that performing the NAPI loop before check the task work would be a better ordering since the napi loop can generates task work items.

if it can be rationalized that executing io_sq_tw() before NAPI busy polling then yes, the 2 loops setup is the way to go. Otherwise, the loops can be merged into a single one. It just needs to be sure that the busy polling to conditional to the number of sqe to submit like I proposed

What is your opinion on the matter?

lano1106 commented 2 months ago

Do you share the socket between threads?

it is a big no. sockets are private and exclusive to each thread.

Great, then there shouldn't be really any long waiting for sending, the only ones I can think of is socket lock and allocations. That's unless someone screwed it up, which we can actually try to trace.

I have super interesting data...

I have done this:

echo common_pid = sqp_pid > /sys/kernel/tracing/events/sched/filter
echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable

and I have discovered that the sqd task was switched dozens of times every seconds with RCU related tasks and ksoftirqd

based from what I am seeing on top, the kernel is spending half of its time context switching... There is clearly something off around there...

well, for one, if NAPI busy poll is called more often, I suppose that it can be expected that invoking ksoftirqd will happen less often but the RCU thing... it appears like a serious glitch... sqp_sched.txt

lano1106 commented 2 months ago

I might have created my own problem...

https://www.suse.com/c/cpu-isolation-full-dynticks-part2/

if it is a RCU thing that is causing what I see... AFAIK, the only feature using RCU is NAPI busy poll... tbh, I have never really understood what was the benefit of RCU in that context.... the NAPI devices list lock was considered like too contentious and serious enough to invoke the heavy duty RCU cavalry?

to your knowledge, is there anything else using RCU? in particular around the SEND OP?

This is clearly overkill in my use case considering that my NAPI device list contains a single element and on top of it, the list is practically immutable...

isilence commented 2 months ago

I think makes more sense to keep it in a separate loop after submitting everything there is to submit
io_sq_thread() {
    list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
        __io_sq_thread(ctx);
    io_sq_tw();
    list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
        run_napi(ctx);

}
ok. it makes sense... I guess the ordering is moot since it is an infinite loop but I would have thought that performing the NAPI loop before check the task work would be a better ordering since the napi loop can generates task work items.

What is your opinion on the matter?

That's what also my first thought, but if a sleeping SQPOLL (see the prepare_to_wait() chunk) gets woken by a task_work, we want it to process the task_work without delays and give the result to the userspace, otherwise polling will add to request latency.

isilence commented 2 months ago

Do you share the socket between threads?

it is a big no. sockets are private and exclusive to each thread.

Great, then there shouldn't be really any long waiting for sending, the only ones I can think of is socket lock and allocations. That's unless someone screwed it up, which we can actually try to trace.

I have super interesting data...

Something like this script can be useful, prints unique stack traces for when the task was scheduled out while submitting requests (can be a bit heavy). FWIW, likely you can run it globally without specifying a pid.

https://gist.github.com/isilence/822a3a1116a126040928171352171ec1

lano1106 commented 2 months ago

if (!wq_list_empty(&ctx->iopoll_list) || to_submit) {

I am currently obsessed by optimizing my cache usage and this is the type of things that I have trained my eye to see... unless the current order was intentional to optimize the iopoll case, I think that reversing the evaluation could reduce the thread memory accesses and possibly remove a branch evaluation in the vast majority of the cases where !wq_list_empty(&ctx->iopoll_list) will always be false...

and to_submit is/could already be in a register when the if statement is reached

what do you think of my comment?

don't forget to tell me what you think of this idea... If you are favorable to it, I'll create a small patch for that

lano1106 commented 2 months ago

Do you share the socket between threads?

it is a big no. sockets are private and exclusive to each thread.

Great, then there shouldn't be really any long waiting for sending, the only ones I can think of is socket lock and allocations. That's unless someone screwed it up, which we can actually try to trace.

I have super interesting data...

Something like this script can be useful, prints unique stack traces for when the task was scheduled out while submitting requests (can be a bit heavy). FWIW, likely you can run it globally without specifying a pid.

https://gist.github.com/isilence/822a3a1116a126040928171352171ec1

thx for sharing this script but the more I think about it, the more I believe that the RCU done by the NAPI busy polling code is the culprit of my issue.

Since I have resolved my IRQ issue, I have never tested the sqp performance without busy polling. We did learn in the last 2 days that before attaching my second ring to the sqp thread, despite enabling NAPI polling, I was not really doing any because of what I have discovered in __io_sq_thread().

It should be easy to disable NAPI on my side. If my guess is right, by doing so, I should see SQP thread CPU usage go back to close to 100%.

the problem comes from RCU calls made by the NAPI busy poll code! I have disabled NAPI on my working setup and here is the sqp CPU usage managing correctly my 2 rings:

P     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                     
0     828 lano1106  20   0 1183.5m 102.8m  14.9m R  97.1   0.7   0:54.29 iou-sqp-695

isilence commented 2 months ago

don't forget to tell me what you think of this idea... If you are favorable to it, I'll create a small patch for that

I can't say I care enough tbh, it's a polling loop, I don't what you'd win from doing that, but if you send a patch it'll most likely be taken.

lano1106 commented 2 months ago

I thought that it was part of the io_uring hot path and any micro optimization could show up in benchmark runs...

it is already tested on my VPS. this does not break anything at least. I will send out a patch for it....

isilence commented 2 months ago

I thought that it was part of the io_uring hot path and any micro optimization could show up in benchmark runs...

I'm all for micro optimisations when they can benefit you somehow. In case of SQPOLL, if it's doing 100% of time some useful work (submitting requests, executing task_work, posting CQEs and so on), then yes, that's useful. But I'd guess it's usually under 100%, and it'll still continue to burn CPUs until it's out of time, so not like will change anything.

Anyway, I'm not against a patch like that

lano1106 commented 2 months ago

Pavel, I have just send out patches that we have discussed today. Pay special attention to the NAPI busy poll refactoring one...

in order to test it, I have reenabled my use of the NAPI busy poll in my rings. I was expecting the same 50% slowdown but in an unexplainable way, it did kept spinning at 100%... (but the insane RCU context switches continue)

I am bit suspicious of that result. It is as if my code refactoring was in fact NOT calling busy poll function... but I did double check and review the code and all seems good... so idk... just pay special attention in case you see a mistake that I did that I am somehow incapable of seeing.

lano1106 commented 2 months ago

I thought that it was part of the io_uring hot path and any micro optimization could show up in benchmark runs...

I'm all for micro optimisations when they can benefit you somehow. In case of SQPOLL, if it's doing 100% of time some useful work (submitting requests, executing task_work, posting CQEs and so on), then yes, that's useful. But I'd guess it's usually under 100%, and it'll still continue to burn CPUs until it's out of time, so not like will change anything.

Anyway, I'm not against a patch like that

anyway it is done... you never know when minimizing memory access will pay off in some unexpected way... is iopoll_list on its own cacheline garanteeing that false sharing with another thread is impossible?

lano1106 commented 2 months ago

some closing comments:

the context switches were mostly caused by the kernel parameter rcu_nocb_poll
even if io_uring was not using RCU, NAPI does extensively, so it is a futile exercise to try doing networking io on a nohz_full cpu.

lano1106 commented 2 months ago

and in the last 6 hours, my ring0 did receive 0 SQE...

That's what I mean by it's look like the SQPOLL you're looking at either does nothing or at least so little that doesn't even show up in the trace.

it did just occurred to me that the reason why you were not seeing much in the perf report. It must be due to the fact that sqpoll was running on a nohz_full processor. I think that perf needs interrupts to work properly...

Attempting to use nohz_full clearly makes someone exit Kansas for real...

https://www.kernel.org/doc/html/latest/timers/no_hz.html

I have stopped running sqpoll thread on a nohz_full node. It is futile to try to run the massively RCU enabled net code on such node...

I keep 2 out of 4 processors as nohz_full...

this whole adventure did allow me to shave an extra 10uSec in my reaction time... I'm still not fully satisfied... I need to optimize even more...

isilence commented 2 months ago

and in the last 6 hours, my ring0 did receive 0 SQE...

That's what I mean by it's look like the SQPOLL you're looking at either does nothing or at least so little that doesn't even show up in the trace.

it did just occurred to me that the reason why you were not seeing much in the perf report. It must be due to the fact that sqpoll was running on a nohz_full processor. I think that perf needs interrupts to work properly...

I have to say I'm not familiar with nohz_full, need to read the article you linked, would be interesting if that's what made it so, thanks for the hint

Attempting to use nohz_full clearly makes someone exit Kansas for real...

https://www.kernel.org/doc/html/latest/timers/no_hz.html

I have stopped running sqpoll thread on a nohz_full node. It is futile to try to run the massively RCU enabled net code on such node...

Yeah, there is always a lot of RCU regardless whether io_uring or some particular piece uses it, and not like we can live without RCU.

I keep 2 out of 4 processors as nohz_full...

this whole adventure did allow me to shave an extra 10uSec in my reaction time... I'm still not fully satisfied... I need to optimize even more...

Sounds like a good result!