axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.89k stars 407 forks source link

[BUG] Spurious crash of io wq worker in io_ring_exit_work #1250

Closed CPestka closed 1 month ago

CPestka commented 1 month ago

Managed to trip this twice while running liburings tests. I think once on 206650ff72b6ea4d76921f9c91ebfffd9902e6a0 and once on cb02d22dba0c6005c877c14a2cd2574a4b95462c, but not entirely sure on that, could have also been both on either one. Reran the tests a bunch of times afterwards on both commits to narrow it down or to rerun it with debug symbols, but didnt manage to trip it again.... Let me know if you want the full dmesg, I have the full one of the second trip, but only a screenshot of the first.

Btw, is it fine to report this here or should stuff like this go to the mailing list?

[  571.383944] ------------[ cut here ]------------
[  571.383951] WARNING: CPU: 9 PID: 22626 at io_uring/io_uring.c:3093 io_ring_exit_work+0x115/0x2e0
[  571.383964] Modules linked in: snd_seq_dummy snd_hrtimer xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo rpcsec_gss_krb5 auth_rpcgss xt_addrtype nft_compat br_netfilter bridge stp llc nfsv4 nfs lockd grace netfs overlay nf_tables qrtr sunrpc binfmt_misc zfs(PO) spl(O) intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul snd_hda_codec_realtek polyval_clmulni snd_hda_codec_generic polyval_generic ghash_clmulni_intel snd_hda_codec_hdmi sha256_ssse3 snd_usb_audio sha1_ssse3 snd_usbmidi_lib snd_ump aesni_intel snd_seq_midi snd_hda_intel crypto_simd snd_seq_midi_event cryptd snd_rawmidi snd_intel_dspcfg snd_intel_sdw_acpi rapl snd_hda_codec snd_hda_core snd_hwdep nls_iso8859_1 snd_seq snd_pcm gigabyte_wmi wmi_bmof k10temp snd_seq_device i2c_piix4 uvcvideo videobuf2_vmalloc uvc cdc_acm videobuf2_memops videobuf2_v4l2 snd_timer videodev ccp snd videobuf2_common mc soundcore nvidia_uvm(POE) input_leds joydev
[  571.384045]  mac_hid sch_fq_codel msr parport_pc ppdev lp parport nvme_fabrics efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) hid_generic usbhid hid video nvme crc32_pclmul ahci xhci_pci igb nvme_core i2c_algo_bit dca xhci_pci_renesas libahci nvme_auth wmi
[  571.384081] CPU: 9 PID: 22626 Comm: kworker/u68:128 Tainted: P           OE      6.8.0-45-generic #45-Ubuntu
[  571.384085] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F5b 09/17/2019
[  571.384088] Workqueue: iou_exit io_ring_exit_work
[  571.384093] RIP: 0010:io_ring_exit_work+0x115/0x2e0
[  571.384097] Code: c1 b6 e8 3e a0 01 00 4c 89 ef e8 96 bb 00 00 48 89 df e8 0e cb ff ff 4c 89 7c 24 20 48 8b 05 c2 e7 be 01 48 39 44 24 18 79 0b <0f> 0b 48 c7 44 24 10 60 ea 00 00 48 8b 74 24 10 48 8b 7c 24 08 e8
[  571.384100] RSP: 0018:ffffa6c210627da0 EFLAGS: 00010297
[  571.384102] RAX: 00000001000424fe RBX: ffff8b6f0f376000 RCX: 0000000000000000
[  571.384105] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  571.384106] RBP: ffffa6c210627e40 R08: 0000000000000000 R09: 0000000000000000
[  571.384108] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b6f0f376530
[  571.384109] R13: 0000000000000000 R14: ffff8b6f0f376040 R15: 0000000000000000
[  571.384111] FS:  0000000000000000(0000) GS:ffff8b75c6480000(0000) knlGS:0000000000000000
[  571.384113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  571.384115] CR2: 000000c000765010 CR3: 000000016b924000 CR4: 0000000000350ef0
[  571.384117] Call Trace:
[  571.384120]  <TASK>
[  571.384125]  ? show_regs+0x6d/0x80
[  571.384131]  ? __warn+0x89/0x160
[  571.384136]  ? io_ring_exit_work+0x115/0x2e0
[  571.384140]  ? report_bug+0x17e/0x1b0
[  571.384145]  ? handle_bug+0x51/0xa0
[  571.384149]  ? exc_invalid_op+0x18/0x80
[  571.384152]  ? asm_exc_invalid_op+0x1b/0x20
[  571.384158]  ? io_ring_exit_work+0x115/0x2e0
[  571.384164]  process_one_work+0x16f/0x350
[  571.384170]  worker_thread+0x306/0x440
[  571.384174]  ? __pfx_worker_thread+0x10/0x10
[  571.384178]  kthread+0xf2/0x120
[  571.384181]  ? __pfx_kthread+0x10/0x10
[  571.384184]  ret_from_fork+0x47/0x70
[  571.384188]  ? __pfx_kthread+0x10/0x10
[  571.384190]  ret_from_fork_asm+0x1b/0x30
[  571.384196]  </TASK>
[  571.384197] ---[ end trace 0000000000000000 ]---
axboe commented 1 month ago

Unfortunately 6.8 is not a maintained stable kernel, so there's not much I can do with the report... If it hits in 6.1/6.6/6.10 or anything that's still a currently active kernel (see kernel.org), then we can do something about it.

Did you try and isolate which test actually caused this? It's the async ring exit that's throwing a warning after 60 seconds of trying to cancel IO. So unfortunately detached from the actual test case that caused this.

CPestka commented 1 month ago

Unfortunately 6.8 is not a maintained stable kernel

Ah, my bad forgot about that. If I dont forget I check tomorrow if I can get it to trip on 6.10 or your tree.

Did you try and isolate which test actually caused this?

No, did not really get to that. It tripped twice in a row with a reboot in between and then just did not trip again, so I did not get to disabling parts of the tests. The bundle send/recv tests and the getsetsock-cmd tests always time out on the kernel I used though. Which on second thoguht is a bit odd. I just checked and at least the bundle feat is newer than 6.8, which i guess would mean that the test should have been skipped or gracefully reported that the feature is not supported. Have not looked yet at the actual test though.

axboe commented 1 month ago

Yep triggers for me on 6.8 as well, just tried the latest stable that was released. Have a good idea which commit fixed it, but doesn't really matter that much as 6.8 is dead anyway and no more releases will be made.

I do see the test failures you mention too. But at least we can fix those up :-)

axboe commented 1 month ago

Fixed up recvsend_bundle, socket-getsetsock-cmd.t is just correctly finding an issue in 6.8 that we fixed. So as far as failures go, that one should fail on 6.8 and so should linked-defer-close.

axboe commented 1 month ago

Closing this one.