axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.83k stars 402 forks source link

`test/sqpoll-cancel-hang` silently segfault in child process because `mmap` returns `-EINVAL` #1134

Closed ammarfaizi2 closed 5 months ago

ammarfaizi2 commented 5 months ago

Kernel: 6.8.2-afs-2024-04-02

test/sqpoll-cancel-hang consistently segfaults but the test framework doesn't know about it since the segfault happens in a child process from a fork().

The cause seems to be mmap() returns -1 (MAP_FAILED) with errno = EINVAL, and the program tries to dereference address 0xffffffffffffffff + 0x10c (you can see the register %r8=0xffffffffffffffff being dereferenced below).

[1448153.270610] sqpoll-cancel-h[393886]: segfault at 10b ip 0000557062d3ed95 sp 00007fff1cec6330 error 4 in sqpoll-cancel-hang.t[557062d3e000+3000] likely on CPU 9 (core 9, socket 0)
[1448153.270623] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
[1448172.164437] sqpoll-cancel-h[393930]: segfault at 10b ip 0000555555555d95 sp 00007fffffffe250 error 4 in sqpoll-cancel-hang.t[555555555000+3000] likely on CPU 10 (core 10, socket 0)
[1448172.164453] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
[1448177.856890] sqpoll-cancel-h[393934]: segfault at 10b ip 0000555555555d95 sp 00007fffffffe250 error 4 in sqpoll-cancel-hang.t[555555555000+3000] likely on CPU 6 (core 6, socket 0)
[1448177.856905] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
[1448183.866109] sqpoll-cancel-h[393938]: segfault at 10b ip 0000555555555d95 sp 00007fffffffe250 error 4 in sqpoll-cancel-hang.t[555555555000+3000] likely on CPU 15 (core 15, socket 0)
[1448183.866126] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
[1448185.912122] sqpoll-cancel-h[393942]: segfault at 10b ip 0000555555555d95 sp 00007fffffffe250 error 4 in sqpoll-cancel-hang.t[555555555000+3000] likely on CPU 5 (core 5, socket 0)
[1448185.912135] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
[1448192.821344] sqpoll-cancel-h[393946]: segfault at 10b ip 0000555555555d95 sp 00007fffffffe250 error 4 in sqpoll-cancel-hang.t[555555555000+3000] likely on CPU 11 (core 11, socket 0)
[1448192.821359] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
[1448206.095029] sqpoll-cancel-h[393950]: segfault at 10b ip 0000555555555d95 sp 00007fffffffe250 error 4 in sqpoll-cancel-hang.t[555555555000+3000] likely on CPU 6 (core 6, socket 0)
[1448206.095046] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
[1448212.429829] sqpoll-cancel-h[393955]: segfault at 10b ip 00005603863e9d95 sp 00007fff87ac6430 error 4 in sqpoll-cancel-hang.t[5603863e9000+3000] likely on CPU 10 (core 10, socket 0)
[1448212.429845] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
[1448360.029028] sqpoll-cancel-h[393978]: segfault at 10b ip 0000561293442d95 sp 00007ffe5f608fd0 error 4 in sqpoll-cancel-hang.t[561293442000+3000] likely on CPU 15 (core 15, socket 0)
[1448360.029044] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
[1448365.700707] sqpoll-cancel-h[393983]: segfault at 10b ip 000055a305140d95 sp 00007ffd9da0ff90 error 4 in sqpoll-cancel-hang.t[55a305140000+3000] likely on CPU 10 (core 10, socket 0)
[1448365.700722] Code: 00 20 00 00 00 00 66 c7 04 25 a8 01 00 20 00 00 66 c7 04 25 aa 01 00 20 00 00 f3 ab 4c 8b 05 8a 52 00 00 48 8b 0d 8b 52 00 00 <41> 8b 80 0c 01 00 00 41 8b b8 08 01 00 00 c1 e0 04 44 8d 88 7f 01
gdb test/sqpoll-cancel-hang

# Inside gdb
set follow-fork-mode child
run
Thread 3.1 "sqpoll-cancel-h" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7fad740 (LWP 394089)]
trigger_bug () at sqpoll-cancel-hang.c:133
133     syz_io_uring_submit(r[1], r[2], 0x20000180, 1);

[ Legend: Modified register | Code | Heap | Stack | String ]
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── registers ────
$rax   : 0x0               
$rbx   : 0x3               
$rcx   : 0xffffffffffffffff
$rdx   : 0x1               
$rsp   : 0x00007fffffffe250  →  0x0000000000000000
$rbp   : 0x1               
$rsi   : 0x0000000020000180  →  0x000000000000010b
$rdi   : 0x00000000200001c0  →  0xffffffffffffffff
$rip   : 0x0000555555555d95  →  <main+629> mov 0x10c(%r8), %eax
$r8    : 0xffffffffffffffff
$r9    : 0x10000000        
$r10   : 0x8011            
$r11   : 0x246             
$r12   : 0x0               
$r13   : 0x0000555555555b20  →  <main+0> endbr64 
$r14   : 0x000055555555acc8  →  0x0000555555555ef0  →  <__do_global_dtors_aux+0> endbr64 
$r15   : 0x00007ffff7ffd040  →  0x00007ffff7ffe2e0  →  0x0000555555554000  →   jg 0x555555554047
$eflags: [zero CARRY parity ADJUST sign trap INTERRUPT direction overflow RESUME virtualx86 identification]
$cs: 0x33 $ss: 0x2b $ds: 0x00 $es: 0x00 $fs: 0x00 $gs: 0x00 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── stack ────
0x00007fffffffe250│+0x0000: 0x0000000000000000   ← $rsp
0x00007fffffffe258│+0x0008: 0x8f8f49cc98574400
0x00007fffffffe260│+0x0010: 0x0000000000000000
0x00007fffffffe268│+0x0018: 0x0000000000000000
0x00007fffffffe270│+0x0020: 0x00007fffffffe388  →  0x00007fffffffe64a  →  "/tmp/liburing/test/sqpoll-cancel-hang.t"
0x00007fffffffe278│+0x0028: 0x00007ffff7c29d90  →  0xe80001b859e8c789
0x00007fffffffe280│+0x0030: 0x0000000000000000
0x00007fffffffe288│+0x0038: 0x0000555555555b20  →  <main+0> endbr64 
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── code:x86:64 ────
   0x555555555d85 <main+613>       rep    stos %eax, %es:(%rdi)
   0x555555555d87 <main+615>       mov    0x528a(%rip), %r8        # 0x55555555b018 <r+8>
   0x555555555d8e <main+622>       mov    0x528b(%rip), %rcx        # 0x55555555b020 <r+16>
 → 0x555555555d95 <main+629>       mov    0x10c(%r8), %eax
   0x555555555d9c <main+636>       mov    0x108(%r8), %edi
   0x555555555da3 <main+643>       shl    $0x4, %eax
   0x555555555da6 <main+646>       lea    0x17f(%rax), %r9d
   0x555555555dad <main+653>       and    $0xffffffc0, %r9d
   0x555555555db1 <main+657>       test   %edi, %edi
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── source:sqpoll-cancel-hang.c+84 ────
     79  {
     80      char* ring_ptr = (char*)a0;
     81      char* sqes_ptr = (char*)a1;
     82      char* sqe = (char*)a2;
     83      uint32_t sqes_index = (uint32_t)a3;
 →   84      uint32_t sq_ring_entries = *(uint32_t*)(ring_ptr + SQ_RING_ENTRIES_OFFSET);
     85      uint32_t cq_ring_entries = *(uint32_t*)(ring_ptr + CQ_RING_ENTRIES_OFFSET);
     86      uint32_t sq_array_off = (CQ_CQES_OFFSET + cq_ring_entries *
     87  SIZEOF_IO_URING_CQE + 63) & ~63;
     88      if (sq_ring_entries)
     89          sqes_index %= sq_ring_entries;
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── threads ────
[#0] Id 1, Name: "sqpoll-cancel-h", stopped 0x555555555d95 in trigger_bug (), reason: SIGSEGV
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0x555555555d95 → trigger_bug()
[#1] 0x555555555d95 → main()
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
root@server-vie001:/tmp/liburing# strace -e mmap -ivf test/sqpoll-cancel-hang.t
[00007fc88a281cb7] mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc88a259000
[00007fc88a281cb7] mmap(NULL, 41839, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fc88a24e000
[00007fc88a281cb7] mmap(NULL, 2264656, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fc88a000000
[00007fc88a281cb7] mmap(0x7fc88a028000, 1658880, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x28000) = 0x7fc88a028000
[00007fc88a281cb7] mmap(0x7fc88a1bd000, 360448, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bd000) = 0x7fc88a1bd000
[00007fc88a281cb7] mmap(0x7fc88a216000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x215000) = 0x7fc88a216000
[00007fc88a281cb7] mmap(0x7fc88a21c000, 52816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fc88a21c000
[00007fc88a281cb7] mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc88a24b000
[00007fc88a11ea27] mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x20000000
strace: Process 394108 attached
[pid 394108] [00007fc88a11ea27] mmap(0x20400000, 1179712, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_POPULATE, 3, 0) = -1 EINVAL (Invalid argument)
[pid 394108] [00007fc88a11ea27] mmap(0x20ffd000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_POPULATE, 3, 0x10000000) = -1 EINVAL (Invalid argument)
[pid 394108] [0000558dfa93ed95] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x10b} ---
[pid 394108] [????????????????] +++ killed by SIGSEGV (core dumped) +++
[00007fffd4b289fc] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_DUMPED, si_pid=394108, si_uid=0, si_status=SIGSEGV, si_utime=0, si_stime=0} ---
[????????????????] +++ exited with 0 +++
root@server-vie001:/tmp/liburing# 

See the last two mmap(s) return -EINVAL.

axboe commented 5 months ago

It's one of those awful syzbot based tests, and it's already disabled on some archs. We should probably just kill it, if it's still failing on some.

axboe commented 5 months ago

Killed