Issue generating core dumps since using io_uring

lano1106 commented 3 years ago

I took care of the usual suspects when this is happening.

ulimit -c unlimited

Adding: prctl(PR_SET_DUMPABLE, 1); in the main() because I do set few cap bits to the binary exec file.

When my io_uring process is killed with SEGV signal, the generated core dump has a size of 0 bytes.

Could it be related to some security feature around the fact that memory is shared with the kernel when using io_uring?

If it is, any way to circumvent the kernel from NOT core dumping? Not having core dumps is seriously reducing my capability to debug my app...

isilence commented 3 years ago

Weird. Try to do a fair experiment -- first crash the same exactly code with all yours prctl but without io_uring, and then with it. I really doubt it's io_uring related.

lano1106 commented 3 years ago

ok thx. I'll try to recreate the problem with a simple case. I'll post with the details and close the issue if not conclusive.

lano1106 commented 3 years ago

There won't be any easy battle it seems. I have selected https://github.com/frevib/io_uring-echo-server for my core dump test and liburing is failing on me when attempting to mmap() the sqes for some weird reason that I am tracking down by adding some printfs...

$ ./io_uring_echo_server 1976 io_uring echo server listening for connections on port: 1976 entries: 2048 p->sq_entries: 2048 sizeof(struct io_uring_sqe): 64 sq->sqes = mmap(0, 4) failed io_uring_queue_mmap() failed io_uring_init_failed... : Invalid argument

lano1106 commented 3 years ago

Almost there but why is it so painful....

$ ./io_uring_echo_server 1976 io_uring echo server listening for connections on port: 1976 Before memset() in io_uring_queue_mmap(): p->sq_entries: 2048 After memset() in io_uring_queue_mmap(): p->sq_entries: 0 io_uring_init_failed... : Invalid argument

lano1106 commented 3 years ago

almost but why....

$ ./io_uring_echo_server 1976 io_uring echo server listening for connections on port: 1976 from io_uring_echo-server main() sizeof(struct io_uring): 160 From io_uring_queue_mmap() sizeof(*ring): 216 sizeof(struct io_uring): 216 Before memset() in io_uring_queue_mmap(): p->sq_entries: 2048 After memset() in io_uring_queue_mmap(): p->sq_entries: 0 io_uring_init_failed... : Invalid argument

lano1106 commented 3 years ago

idk how liburing is compiled but io_uring_echo-server compile options cannot be more vanilla than this...

cc io_uring_echo_server.c -o ./io_uring_echo_server -I./liburing/src/include/ -L./liburing/src/ -Wall -O2 -D_GNU_SOURCE -luring

oh... I got it... The little bastard has its own liburing private copy...

lano1106 commented 3 years ago

ok... back to tracks... Yes. I do get a core dump with the simple io_uring_echo_server... Back to my application... I control if I do use io_uring or not with the following function:

namespace {
enum { MIN_MEMLOCK_VAL = 2048 };
}

/*
 * getBestEvBackend()
 */
unsigned getBestEvBackend(unsigned fallbackChoice)
{
    struct rlimit rlim;
    if (unlikely(getrlimit(RLIMIT_MEMLOCK, &rlim) == -1)) {
        ERROR0("getrlimit() call failed");
        return fallbackChoice;
    }
    if (rlim.rlim_cur == RLIM_INFINITY || rlim.rlim_cur >= MIN_MEMLOCK_VAL)
        return EVBACKEND_IOURING;
    WARN2("RLIMIT_MEMLOCK too low for io_uring backend: soft: %ld, hard: %ld",
          rlim.rlim_cur, rlim.rlim_max);
    /*
     * Try to increase MEMLOCK soft limit.
     * This will work with a privileged process (root or CAP_SYS_RESOURCE)
     */
    rlim.rlim_cur = MIN_MEMLOCK_VAL;
    if (rlim.rlim_max != RLIM_INFINITY && rlim.rlim_max < MIN_MEMLOCK_VAL)
        rlim.rlim_max = MIN_MEMLOCK_VAL;
    if (setrlimit(RLIMIT_MEMLOCK, &rlim) == -1) {
        WARN0("setrlimit() call failed. Add CAP_SYS_RESOURCE to the exec file to make it succeed");
        return fallbackChoice;
    }
    return EVBACKEND_IOURING;
}

and with my current settings, I am not using io_uring: $ ulimit -a real-time non-blocking time (microseconds, -R) unlimited core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 40 file size (blocks, -f) unlimited pending signals (-i) 128115 max locked memory (kbytes, -l) 1024 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 128115 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

app logs confirm this. I do get a core dump but it is truncated and I have those suspicious warnings from gdb:

BFD: warning: /home/lano1106/core/core.3259423 is truncated: expected core file size >= 37523456, found: 31792

warning: Can't open file /var/db/nscd/services during file-backed mapping note processing

warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing

warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing

warning: Can't open file /var/db/nscd/hosts during file-backed mapping note processing

warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing

warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing

nscd is glibc stuff...

I'll need to double-check throughfully... it seems like despite thinking not using io_uring, I still do... Those warnings must be refering to liburing mmaps...

lano1106 commented 3 years ago

ok ok... I'm getting sloppy... I should take a day off to have a clearer mind... My function succeeds and I do use 2 io_uring instances.

Let me change things to force epoll instead...

lano1106 commented 3 years ago

So, if my app does use epoll, I do get a well formatted core dump file:

warning: Can't open file /var/db/nscd/services during file-backed mapping note processing

warning: Can't open file /var/db/nscd/hosts during file-backed mapping note processing
[New LWP 3260392]
[New LWP 3260393]
[New LWP 3260396]
[New LWP 3260395]
[New LWP 3260397]
[New LWP 3260398]
Warning: couldn't activate thread debugging using libthread_db: Cannot find new threads: generic error

warning: File "/usr/lib/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
    add-auto-load-safe-path /usr/lib/libthread_db-1.0.so
line to your configuration file "/home/lano1106/.gdbinit".
To completely disable this security protection add
    set auto-load safe-path /
line to your configuration file "/home/lano1106/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
    info "(gdb)Auto-loading safe path"

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `./test_kraken'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fe5f0c3139e in epoll_wait () from /usr/lib/libc.so.6
[Current thread is 1 (Thread 0x7fe5f027bbc0 (LWP 3260392))]
(gdb) where
#0  0x00007fe5f0c3139e in epoll_wait () from /usr/lib/libc.so.6
#1  0x00007fe5f0744cc1 in ?? () from /usr/lib/libev.so.4
#2  0x00007fe5f0747b0b in ev_run () from /usr/lib/libev.so.4
#3  0x00007fe5f1112b0a in lws_service_tsi () from /usr/lib/libwebsockets.so.15
#4  0x00007fe5f12c7cd6 in WebSocketContext::service (this=0x7ffc2e17e698, tsi=0) at ws_context.cpp:289
#5  0x00007fe5f121558a in MTClient::service (this=this@entry=0x7ffc2e17e680) atclient.cpp:89
#6  0x000055b220ad4860 in main (argc=<optimized out>, argv=0x7ffc2e17e680) at test.cpp:273
(gdb)

isilence commented 3 years ago

I just hope it's not related to http://mail.spinics.net/lists/io-uring/msg08205.html

lano1106 commented 3 years ago

no I do not think that it is:

top - 10:59:49 up 10:44,  2 users,  load average: 2.31, 2.11, 2.04
Threads: 199 total,   3 running, 196 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.2 sy, 24.9 ni, 74.5 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  64267.1 total,  60413.3 free,   3271.9 used,    581.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  60395.7 avail Mem 

P     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND          
2    1101 lano1106 -99   0  380852  23664  10588 S   2.7   0.0   0:02.74 test      
1     404 lano1106  20   0   15944   9092   7644 S   0.0   0.0   0:00.03 systemd          
4     405 lano1106  20   0   21652   2692      0 S   0.0   0.0   0:00.00 (sd-pam)         
4     411 lano1106  20   0   12848   4404   3148 S   0.0   0.0   0:00.04 sshd             
5     414 lano1106  20   0   12848   4492   3240 S   0.0   0.0   0:00.62 sshd             
0     415 lano1106  20   0    8564   4936   3376 S   0.0   0.0   0:00.01 bash             
0     441 lano1106  20   0   10820   3956   3388 R   0.0   0.0   0:52.06 top              
1     497 lano1106  20   0   13260   5528   3856 S   0.0   0.0   0:01.60 sshd             
0     498 lano1106  20   0    9072   5728   3708 S   0.0   0.0   0:00.15 bash             
4    1103 lano1106  20   0  380852  23664  10588 S   0.0   0.0   0:00.00 Logging          
4    1106 lano1106  20   0  380852  23664  10588 S   0.0   0.0   0:00.00 Housekeeping     
2    1107 lano1106 -99   0  380852  23664  10588 S   0.0   0.0   0:00.00 iou-mgr-1101     
5    1110 lano1106  20   0  380852  23664  10588 S   0.0   0.0   0:00.00 MariaDB   
0    1111 lano1106  20   0  380852  23664  10588 S   0.0   0.0   0:00.00 Account   
3    1112 lano1106  rt   0  380852  23664  10588 S   0.0   0.0   0:00.00 Execution SVC    
3    1113 lano1106  rt   0  380852  23664  10588 S   0.0   0.0   0:00.00 iou-mgr-1112     
6    1114 lano1106 -99   0  380852  23664  10588 S   0.0   0.0   0:00.00 iou-wrk-1101     
5    1109 lano1106  20   0    5396    740    656 S   0.0   0.0   0:00.00 tail

lano1106 commented 3 years ago

Here is a quick test program that attempts to capture the essence of my app io_uring usage. My libev app crash when completion events are processed. I assume that when it crash, 1 thread is inside io_uring_enter, while the other is processing the events. The way libev does its magic inside its main_function ev_run:

Add new fd to the fd set
Call backend_poll (io_uring_enter)
Process all cqes and put them in a standard format in a pending event list
After having returned from backend_poll, traverse the pending event list and call the registered user callbacks (This is where my app crash when it does)

Rename to [test_io_uring_coredump.cpp test_io_uring_coredump.txt

but with the test program, I haven't been able to reproduce my issue...

Is there any way to enable some kernel traces to find out what is stopping it to generate the core dump file??

lano1106 commented 3 years ago

I did find fs/coredump.c

It can leaves some traces when the kernel has some problems with generating the coredump. I see nothing in dmesg...

The only differences that I see between epoll and io_uring is the kernel thread iou-mgr-xxx and iou-wrk-xxx...

Is it possible that the way those kernel threads get terminated during the crash, is stopping the kernel from generating the dump?

This start to be above my skillset/knowledge at this point...

$ sudo dmesg | grep core
[    0.285670] watchdog: Disabling watchdog on nohz_full cores by default
[    0.291118] pinctrl core: initialized pinctrl subsystem
[    0.388245] usbcore: registered new interface driver usbfs
[    0.388245] usbcore: registered new interface driver hub
[    0.388245] usbcore: registered new device driver usb
[    0.388245] pps_core: LinuxPPS API ver. 1 registered
[    0.388245] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    0.730500] usbcore: registered new interface driver usbserial_generic
[    0.731814] intel_pmc_core INT33A1:00:  initialized
[    3.346239] RAPL PMU: hw unit of domain pp0-core 2^-14 Joules
[    3.780473] usbcore: registered new interface driver usbhid
[    3.780474] usbhid: USB HID core driver
[    4.031182] intel_rapl_common: Found RAPL domain core
[25066.617500] audit: type=1130 audit(1620742357.635:157): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@0-1029-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[25067.005283] audit: type=1131 audit(1620742358.025:158): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@0-1029-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

lano1106 commented 3 years ago

One detail worth mentionning, in https://github.com/axboe/liburing/issues/345, I describe the recent changes that I did to my io_uring usage.

In a first step, I was using stock libev which is only using io_uring for polling fds (iow, only using IORING_OP_POLL_ADD and IORING_OP_POLL_REMOVE).

In the second step, I did modify libev to do async I/O with io_uring with the operations (on top of mentioned poll operations above):

IORING_OP_READ (with the IOSQE_BUFFER_SELECT flag like in the test program)
IORING_OP_WRITE
IORING_OP_PROVIDE_BUFFERS

This is only since I have integrated the async I/O that the io_uring kernel threads did appear in the list of threads of my process and it is also from that moment that the core dump stopped being generated properly... TBH, without the new step 2 io_uring code, the app was very stable and could run non-stop for weeks... It is crashing only since last weekend because of the new code and I have a hard time nailing those few remaining bugs because I don't have the core dumps...

Finally, if it is a kernel side issue, it might be worth mentioning the kernel version that I am using. It is 5.12.2

isilence commented 3 years ago

Can you leave only polling but set IOSQE_ASYNC flag? In the end I guess it's related to io_uring worker threads...

lano1106 commented 3 years ago

isilence, I think that I have unmasked you. I was looking emails that I did exchange with io_uring list and I have just noticed that one of the list participants does have silence in his email address ;-)

Yes, I can modify my code to revert back to the polling only io_uring usage... It should take me few hours to do and report back the results.

Can you be more explicit about what I should do with the IOSQE_ASYNC flag? Is it meant to be used with IORING_OP_POLL_ADD/IORING_OP_POLL_REMOVE operations? What is it for?

lano1106 commented 3 years ago

oh. I have just read about IOSQE_ASYNC in the io_uring_enter() man page... Very useful flag... I already get something positive from all this misfortune... I can certainly make good use of it now that I am aware of it...

super cool! the API that keep surprise its users!

Most of my reads have to be ASYNC...

isilence commented 3 years ago

isilence, I think that I have unmasked you. I was looking emails that I did exchange with io_uring list and I have just noticed that one of the list participants does have silence in his email address ;-)

Wasn't hiding, name is same in any case

Yes, I can modify my code to revert back to the polling only io_uring usage... It should take me few hours to do and report back the results.

Oh, would be waste of time, from what you've described it's worker threads that screw it

Can you be more explicit about what I should do with the IOSQE_ASYNC flag? Is it meant to be used with IORING_OP_POLL_ADD/IORING_OP_POLL_REMOVE operations? What is it for?

isilence commented 3 years ago

oh. I have just read about IOSQE_ASYNC in the io_uring_enter() man page... Very useful flag... I already get something positive from all this misfortune... I can certainly make good use of it now that I am aware of it...

super cool! the API that keep surprise its users!

Most of my reads have to be ASYNC...

Don't remember what's in man, but don't get fooled by that cunning document. io_uring will always try to do its job in async manner and there are lots of mechanisms for that, but when it runs out of better options it will fallback to worker threads, that is a slow path. And IOSQE_ASYNC instructs to punt requests to the worker threads.

Poll requests are better off, they are async by other mean, bdev read/write as well, buffered reads got a better async path, for sockets -- it depends. Tl;dr; better to not use it unless you have a problem, e.g. with latency when io_uring_enter() takes too much time submitting requests.

lano1106 commented 3 years ago

Yes, I can modify my code to revert back to the polling only io_uring usage... It should take me few hours to do and report back the results.

Oh, would be waste of time, from what you've described it's worker threads that screw it

are there any tracepoints that I could collect to perhaps bring more light on the issue?

isilence commented 3 years ago

I wasn't following the thread, but sounds similar

http://mail.spinics.net/lists/io-uring/index.html#08075 http://mail.spinics.net/lists/io-uring/msg08395.html

isilence commented 3 years ago

Hmm, if it wasn't about x86 or even compat mode...

lano1106 commented 3 years ago

I have seen the gdb related commit thread pass... My understanding of it was to make the threads more gdb friendly not that without the patch, core dump generation was impossible... thx for the pointers... I will carefully read them to see if they would contain the solution to my problem...

Bottomline, if it solve my problem... This is 5.13 code and we are many weeks away from its release. Correct?

isilence commented 3 years ago

I have seen the gdb related commit thread pass... My understanding of it was to make the threads more gdb friendly not that without the patch, core dump generation was impossible... thx for the pointers... I will carefully read them to see if they would contain the solution to my problem...

Not just friendly, gdb was not working with it doing some strange stuff in that case.

Bottomline, if it solve my problem... This is 5.13 code and we are many weeks away from its release. Correct?

Yes, almost a full release, vaguely ~2 months (if it ever got applied)

lano1106 commented 3 years ago

Ok. worse case scenario, it doesn't seem like a super big commit that would be a nightmare to apply on my 5.12 kernel source code.

Like every good problems, it seems this one is a multi step problem. First step is to get the core dump, next I'll start worrying about gdb...

Concerning the IOSQE_ASYNC, I believe that my TCP sockets would be good candidates for it. My epoll code was obviously manipulating non blocking fds. On my first io_uring async run before clearing their O_NONBLOCK flag, my read sqes were most of the time completed with the error EAGAIN.

iow, I have replaced the EPOLLIN polling with async reads...

which is a super good thing. My rx rings were overflowing (I see rings everywhere since I started using io_uring) because I have discovered that the websocket lib code was processing a single WS frame on each event loop iteration before returning to polling for more input . I did change its code to process all the bytes returned by IORING_OP_READ at each iteration instead which is clearly an efficiency improvement...

lano1106 commented 3 years ago

Yes, almost a full release, vaguely ~2 months (if it ever got applied)

you are correct. Now that 5.13-rc1 is released, I have read that the targetted release date was something around the end of june...

lano1106 commented 3 years ago

Ok I see what you mean when you said use IOSQE_ASYNC as a last resort only. Using that flag did create dozens of iou worker threads...

I do not like what I see in top... I will revert back from using IOSQE_ASYNC...

victorstewart commented 3 years ago

for what it's worth i ran the test program on my machine then triggered the crash and the core dumped.

root@clr-18a47c16c91c437c8ce854a37e6f5e26~ # uname -a
Linux clr-18a47c16c91c437c8ce854a37e6f5e26 5.12.1-1038.native #1 SMP Tue May 4 14:25:03 PDT 2021 x86_64 GNU/Linux

root@clr-18a47c16c91c437c8ce854a37e6f5e26~ # ./test_io_uring_coredump
Aborted (core dumped)

victorstewart commented 3 years ago

maybe your code using the io_uring path is poisoning the stack? aka you're overwriting it somehow somewhere so when it goes to produce the dump bad things happen?

lano1106 commented 3 years ago

Hi Victor, same for me.

I haven't been able to reproduce the problem with the small test program. Something is missing in it. I believe that what might be missing is actually use the io_uring instances a bit before crashing it so that it has the chance to actually create an io_worker thread.

Since the day that I have posted the test program, I did find the source of my problem that was making my app crash with other means and I have been working on fixing it.

I suspect that this will become an io_uring classic. An IORING_OP_READ with the IOSQE_BUFFER_SELECT cqe is processed after its associated fd has been closed and the fd/lib user data structure to handle the result has already been destructed. Therefore, I was leaking buffers, and eventually, io_uring was returning me ENOBUFS.

I have been searching if the cqe would contain the bgid somewhere ... It didn't... I had to squeeze the bgid into user_data (and I did find the way to make the same logical OR error than I did few days ago!) https://lore.kernel.org/io-uring/8992f5f989808798ad2666b0a3ef8ae8d777b7de.camel@trillion01.com/T/#mebdce963ac1764d079041aa4075fd00a510757c0

and I also had to store the buffer groups base addresses into the library doing the low-level io_uring handling to be able to return to io_uring the buffer when the described situation arises...

I wonder if io_uring designers did foresee this scenario and if they did consider adding functionality to support the user to handle it such as returning back the bgid in the cqe and somehow allow the user to query io_uring API to retrieve back the buffer group base address.

That being said, it wasn't that terrible to do the right thing on the user-space side alone...

All that to say, that I might have some time now to tweak the test program to make it do what it is designed to do....

isilence commented 3 years ago

btw, can be a problem of systemd-coredump of some other layer. Did you look through journalctl?

lano1106 commented 3 years ago

good point. My problem is making systemd-coredump core dump as well ;-)

$ coredumpctl list
TIME                          PID  UID  GID SIG     COREFILE EXE                                     SIZE
Mon 2021-05-10 15:35:53 PDT 17192 1000 1000 SIGSEGV present  /usr/lib/systemd/systemd-coredump     601.0K
Mon 2021-05-10 21:29:40 PDT 17416 1000 1000 SIGSEGV present  /usr/lib/systemd/systemd-coredump     381.0K
Wed 2021-05-12 13:07:27 PDT  2414 1000 1000 SIGSEGV present  /usr/lib/systemd/systemd-coredump     452.8K
Wed 2021-05-12 20:31:43 PDT  7193 1000 1000 SIGSEGV present  /usr/lib/systemd/systemd-coredump     452.6K

but no, this isn't as trivial as this. I have changed /proc/sys/kernel/core_pattern to work around it and this is how I got the core dumps that I did open to document this ticket...

lano1106 commented 3 years ago

Fixed by commit coredump: Limit what can interrupt coredumps https://lore.kernel.org/io-uring/274c764afe34abdf06abc6a7eaf753a2911c967b.camel@trillion01.com/T/#t

axboe / liburing

Issue generating core dumps since using io_uring #346