Race condition on disconnections immediately after connections

alkisg commented 6 years ago

Hi, the following code exposes some race condition in disconnections:

for i in $(seq 1 9); do
(
    nbd-client server -N /opt/ltsp/i386 /dev/nbd$i
    nbd-client -d /dev/nbd$i
) &
done

After the code runs (sometimes 2-3 runs are needed), some of the nbd-client instances are still running in some hanged state, preventing nbd-client [re/dis]connections, blocking system shutdown etc.

This affects us in LTSP where we have a "connect, check if there's a newer version of the image, disconnect" logic, and it sometimes causes issues due to the aforementioned race condition.

Some of the errors displayed in dmesg:

[ 563.099196] block nbd6: NBD_DISCONNECT [ 563.099477] block nbd6: shutting down socket [ 563.099503] blk_update_request: I/O error, dev nbd6, sector 40892400 [ 563.099587] block nbd6: Receive control failed (result -104) [ 563.103340] block nbd7: NBD_DISCONNECT [ 563.103457] block nbd7: shutting down socket [ 563.103515] blk_update_request: I/O error, dev nbd7, sector 40892024 [ 563.103805] BUG: unable to handle kernel NULL pointer dereference at 000000b8 [ 563.103808] IP: [] nbd_ioctl+0x873/0xa03 [nbd] [ 563.103821] pdpt = 000000002948c001 pde = 0000000000000000 [ 563.103825] Oops: 0000 [#1] SMP [ 563.103826] Modules linked in: nbd cpufreq_conservative cpufreq_powersave cpufreq_userspace evdev crc32_pclmul snd_intel8x0 snd_ac97_codec ac97_bus intel_rapl_perf snd_pcm snd_timer snd joydev pcspkr serio_raw ac soundcore sg battery video button parport_pc ppdev lp parport ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache hid_generic usbhid hid sr_mod cdrom sd_mod ata_generic crc32c_intel aesni_intel xts aes_i586 ohci_pci lrw ehci_pci gf128mul ablk_helper cryptd ohci_hcd ehci_hcd psmouse ahci usbcore libahci usb_common ata_piix i2c_piix4 e1000 libata scsi_mod [ 563.103855] CPU: 0 PID: 962 Comm: nbd-client Not tainted 4.9.0-4-686-pae #1 Debian 4.9.51-1 [ 563.103855] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 563.103857] task: ee2e6040 task.stack: f483e000 [ 563.103858] EIP: 0060:[] EFLAGS: 00010206 CPU: 0 [ 563.103860] EIP is at nbd_ioctl+0x873/0xa03 [nbd] [ 563.103861] EAX: 0000002d EBX: 00001000 ECX: 00001000 EDX: 00000000 [ 563.103862] ESI: 00001000 EDI: 00000000 EBP: f483fe58 ESP: f483fdf8 [ 563.103863] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 563.103864] CR0: 80050033 CR2: 000000b8 CR3: 2e0e5860 CR4: 000406f0 [ 563.103867] Stack: [ 563.103868] e9621000 00001000 00000100 00000001 f5495c00 ee396900 ee3969e8 f499449c [ 563.103871] f4686400 00001000 f4994484 e94fc000 00000000 0000000f f4994428 df1a58d4 [ 563.103874] 98664467 00000000 00000069 00000000 d50cd4a7 f4686400 0006001f f85d4ad0 [ 563.103877] Call Trace: [ 563.103890] [] ? page_add_new_anon_rmap+0x64/0xa0 [ 563.103892] [] ? nbd_queue_rq+0x120/0x120 [nbd] [ 563.103896] [] ? blkdev_ioctl+0x25e/0xa90 [ 563.103898] [] ? do_wp_page+0x134/0x7a0 [ 563.103902] [] ? block_ioctl+0x3c/0x50 [ 563.103904] [] ? blkdev_fallocate+0x2c0/0x2c0 [ 563.103906] [] ? do_vfs_ioctl+0x91/0x720 [ 563.103907] [] ? handle_mm_fault+0x902/0xf40 [ 563.103910] [] ? __raw_callee_save___pv_queued_spin_unlock+0x6/0x10 [ 563.103912] [] ? SyS_ioctl+0x60/0x70 [ 563.103914] [] ? do_fast_syscall_32+0x8a/0x150 [ 563.103918] [] ? sysenter_past_esp+0x47/0x75 [ 563.103919] Code: 8b 45 cc 39 f3 8b 40 54 89 45 d0 0f 87 b6 00 00 00 8d b4 26 00 00 00 00 85 db 0f 84 e0 fe ff ff 8b 45 d4 8b 55 d0 89 f1 8d 04 40 <8b> 54 82 04 89 d0 29 f8 39 f3 0f 46 cb 39 c8 0f 47 c1 01 c7 29 [ 563.103940] EIP: [] [ 563.103942] nbd_ioctl+0x873/0xa03 [nbd] [ 563.103943] SS:ESP 0068:f483fdf8 [ 563.103943] CR2: 00000000000000b8 [ 563.103946] ---[ end trace f2a60801a15f8bb7 ]--- [ 563.104109] block nbd7: Attempted send on closed socket [ 563.104111] blk_update_request: I/O error, dev nbd7, sector 40892024 [ 563.104194] block nbd7: Attempted send on closed socket [ 563.104196] blk_update_request: I/O error, dev nbd7, sector 40892024 [ 563.104198] Buffer I/O error on dev nbd7, logical block 20446012, async page read [ 563.104201] block nbd7: Attempted send on closed socket

josefbacik commented 6 years ago

Sorry I need to check my notifications more often, I'll try and reproduce this today.

Natureshadow commented 6 years ago

Sorry I need to check my notifications more often, I'll try and reproduce this today.

So ☺?

Natureshadow commented 6 years ago

Any news?

Natureshadow commented 6 years ago

Is this how you normally handle bug reports?

abligh commented 6 years ago

You can always apply for a full refund!

More constructively, it's worth pointing out this github project is for the userspace NBD components, and a kernel oops is by definition a kernel problem. The code causing the problem is not within the github project. Josef (who is a volunteer like the rest of us) is however the kernel maintainer, but using the kernel mailinglist and reporting as per https://www.kernel.org/doc/html/v4.10/admin-guide/reporting-bugs.html is a more efficient way to reach the right people.

Natureshadow commented 6 years ago

More constructively, it's worth pointing out this github project is for the userspace NBD components, and a kernel oops is by definition a kernel problem? The code causing the problem is not within the github project. Josef (who is a volunteer like the rest of us) is however the kernel maintainer, but using the kernel mailinglist and reporting as per https://www.kernel.org/doc/html/v4.10/admin-guide/reporting-bugs.html is a more efficient way to reach the right people.

Thank you!

I am in no way saying you have to fix the problem immediately. The reason for my prodding is that the initial reaction conveyed the message that the problem was reported at the correct place, and that it would be looked into shortly. That means that other efforts, like what you hinted at now, are not made to not duplicate work.

Thanks for your clarification!

josefbacik commented 6 years ago

Sorry I fixed these problems and forgot to report back. The panic and such shouldn't happen anymore, and I redid my torture test to verify that it was ok. Let me know if you can still reproduce with a modern kernel.

yoe commented 6 years ago

Since Josef suggests this should have been fixed, I'm going to close this for now. If it does occur again, feel free to reopen.

NetworkBlockDevice / nbd

Race condition on disconnections immediately after connections #59