NetworkBlockDevice / nbd

Network Block Device
GNU General Public License v2.0
459 stars 120 forks source link

NBD: Client failover causes kernel crash #62

Closed mehulvora83 closed 6 years ago

mehulvora83 commented 7 years ago

I ran into kernel crash while testing NBD client/server failover. Here is the stack dump I see on my Ubuntu-16.04 box.

[10554.029187] nbd: registered device at major 43 [10573.523556] EXT4-fs (nbd0): mounting ext2 file system using the ext4 subsystem [10573.524366] EXT4-fs (nbd0): warning: mounting unchecked fs, running e2fsck is recommended [10573.524500] EXT4-fs (nbd0): mounted filesystem without journal. Opts: (null) [10591.278962] block nbd0: Receive control failed (result -512) [10591.278971] block nbd0: pid 115995, nbd-client, got signal 9 [10591.278974] block nbd0: shutting down socket

[10638.646904] block nbd0: Attempted send on closed socket [10638.646908] blk_update_request: I/O error, dev nbd0, sector 4632 [10638.646912] EXT4-fs warning (device nbd0): htree_dirblock_to_tree:958: inode #2: lblock 0: comm ls: error -5 reading directory block [10662.102399] ------------[ cut here ]------------ [10662.102420] kernel BUG at /build/linux-0XAgc4/linux-4.4.0/fs/buffer.c:3005! [10662.102427] invalid opcode: 0000 [#1] SMP [10662.102434] Modules linked in: nbd ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs snd_hda_codec_hdmi binfmt_misc hp_wmi snd_hda_codec_realtek sparse_keymap snd_hda_codec_generic input_leds intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_intel coretemp snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm irqbypass snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq serio_raw snd_seq_device snd_timer sb_edac edac_core lpc_ich snd mei_me mei soundcore shpchp tpm_infineon 8250_fintek mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi [10662.102572] scsi_transport_iscsi parport_pc ppdev lp parport autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid nouveau crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 mxm_wmi lrw video gf128mul glue_helper ablk_helper i2c_algo_bit cryptd ttm drm_kms_helper syscopyarea sysfillrect e1000e sysimgblt psmouse fb_sys_fops ptp ahci drm pps_core libahci wmi fjes [last unloaded: nbd] [10662.102673] CPU: 7 PID: 188844 Comm: umount Not tainted 4.4.0-78-generic #99-Ubuntu [10662.102679] Hardware name: Hewlett-Packard HP Z440 Workstation/212B, BIOS M60 v02.31 12/14/2016 [10662.102686] task: ffff8807deb47000 ti: ffff8807d9e00000 task.ti: ffff8807d9e00000 [10662.102692] RIP: 0010:[] [] submit_bh_wbc+0x152/0x160 [10662.102706] RSP: 0018:ffff8807d9e03d40 EFLAGS: 00010246 [10662.102711] RAX: 0000000000000005 RBX: ffff88079bffbd00 RCX: 0000000000000000 [10662.102719] RDX: 0000000000000000 RSI: ffff88079bffbd00 RDI: 0000000000001411 [10662.103016] RBP: ffff8807d9e03d68 R08: 0000000000000000 R09: 0000000000000fff [10662.103755] R10: 0000000000002d7c R11: 000000000000ef31 R12: 0000000000001411 [10662.104491] R13: 0000000000000008 R14: ffff8800c973a400 R15: ffff880802f83800 [10662.105227] FS: 00007f538a5ba840(0000) GS:ffff88080c7c0000(0000) knlGS:0000000000000000 [10662.105959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10662.106697] CR2: 0000000001a34878 CR3: 00000007deb1b000 CR4: 00000000001406e0 [10662.107431] Stack: [10662.108161] ffff88079bffbd00 0000000000001411 0000000000000008 ffff8800c973a400 [10662.108902] ffff880802f83800 ffff8807d9e03d88 ffffffff812497bc ffffffff81f38d80 [10662.109639] ffff88079bffbd00 ffff8807d9e03dd0 ffffffff812bbf42 0000000000000034 [10662.110378] Call Trace: [10662.111100] [ < ffffffff812497bc > ] sync_dirty_buffer+0x6c/0x100 [10662.111825] [ < ffffffff812bbf42 > ] ext4_commit_super+0x1d2/0x290 [10662.112553] [ < ffffffff812bccb1 > ] ext4_put_super+0xe1/0x390 [10662.113276] [ < ffffffff812111ef > ] generic_shutdown_super+0x6f/0x100 [10662.113988] [ < ffffffff8121157c > ] kill_block_super+0x2c/0xa0 [10662.114694] [ < ffffffff812116d3 > ] deactivate_locked_super+0x43/0x70 [10662.115399] [ < ffffffff81211bac > ] deactivate_super+0x5c/0x60 [10662.116095] [ < ffffffff8122fc0f > ] cleanup_mnt+0x3f/0x90 [10662.116777] [ < ffffffff8122fca2 > ] cleanup_mnt+0x12/0x20 [10662.117458] [ < ffffffff8109f011 > ] task_work_run+0x81/0xa0 [10662.118138] [ < ffffffff81003242 > ] exit_to_usermode_loop+0xc2/0xd0 [10662.118805] [ < ffffffff81003c6e > ] syscall_return_slowpath+0x4e/0x60 [10662.119469] [ < ffffffff81840b90 > ] int_ret_from_sys_call+0x25/0x8f [10662.120121] Code: 44 89 ef e8 81 14 18 00 5b 31 c0 41 5c 41 5d 41 5e 41 5f 5d c3 40 f6 c7 01 0f 84 1c ff ff ff f0 80 63 01 f7 e9 12 ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 40 00 0f 1f 44 00 00 55 31 [10662.121487] RIP [ < ffffffff81247a62 > ] submit_bh_wbc+0x152/0x160 [10662.122161] RSP < ffff8807d9e03d40 >

I have both nbd-server and nbd-client running on the same system, and issue can be reproduced with following commands,

**Server truncate -s 10G /mnt/nbddisk mkfs.ext4 /mnt/nbddisk nbd-server 127.0.0.1@9000 /mnt/nbddisk

**Client modprobe nbd nbd-client 127.0.0.1 9000 /dev/nbd0 mount /dev/nbd0 /mnt/ kill -9 < pid of nbd-client >

After killing nbd-client, remounting /dev/nbd0 to different folder fails with "/dev/nbd0 is already mounted or /mnt1/" busy". Unmounting "/mnt" leads to above kernel crash.

I found below thread reporting the similar crash. I see thread concluded with suggestions, but not sure if the fix is pushed to the mainstream kernel or not. https://sourceforge.net/p/nbd/mailman/message/34486113/

Is there any way this can be fixed in the driver? I would be glad to help in verifying the fix if needed.

Thanks, Mehul.

ayanamist commented 6 years ago

https://lkml.org/lkml/2016/4/20/257 You can backport this patch, but after i test this, i only prevent kernel crash, but mount point still goes read-only, and needs umount and mount, mount -o remount,rw does not work

josefbacik commented 6 years ago

I fixed a bunch of these issues a few months ago, the fixes are in 4.18, can you give that a try and make sure the problem doesn't happen anymore?

yoe commented 6 years ago

Feedback was requested but not provided, so closing this bug report as unreproducible.

If you can still reproduce this, feel free to reopen it with detailed instructions on how to reproduce.