NetworkBlockDevice / nbd

Network Block Device
GNU General Public License v2.0
459 stars 120 forks source link

"Buffer I/O error" regression between kernels 4.11 and 4.12 #57

Closed alkisg closed 6 years ago

alkisg commented 7 years ago

Using Ubuntu 16.04, nbd-server and client=1:3.13-1.

a) With mainline (vanilla) kernel 4.11.12 on the client, the commands below run fine without errors. b) Then I upgrade to mainline kernel 4.12.0, and I run again:

modprobe nbd nbd-client server-ip -N /opt/ltsp/i386 /dev/nbd5 dmesg

And I see the following errors:

[ 73.824873] nbd: registered device at major 43 [ 84.791001] nbd5: detected capacity change from 0 to 20936916992 [ 84.791071] block nbd5: Attempted send on invalid socket [ 84.791077] blk_update_request: I/O error, dev nbd5, sector 0 [ 84.791080] Buffer I/O error on dev nbd5, logical block 0, async page read <the 3 lines above repeated 10 times> [ 84.791132] block nbd5: Attempted send on invalid socket [ 84.791133] blk_update_request: I/O error, dev nbd5, sector 2 [ 84.791134] Buffer I/O error on dev nbd5, logical block 1, async page read [ 84.791140] ldm_validate_partition_table(): Disk read failed. [ 84.791175] Dev nbd5: unable to read RDB block 0 [ 84.791228] nbd5: unable to read partition table

I can reproduce this in many installations, real or VMs. Thanks!

alkisg commented 7 years ago

Ubuntu uses patched kernels with various backports, so the numbers were misleading. I updated the issue description to reflect the mainline (vanilla) numbers instead. Just for completeness, the issue happens: In Ubuntu's kernels: after 4.8.0-58 and before 4.10.0-14 In mainline kernels: after 4.11.12 and before 4.12.0

I also saw that to make things 100% reproducible, I needed to call udevadm settle, so my test case now is:

modprobe nbd udevadm settle nbd-client server-ip -N /opt/ltsp/i386 /dev/nbd5 udevadm settle dmesg | grep nbd

alkisg commented 7 years ago

I also tested on Debian Stretch:

yoe commented 7 years ago

Are you saying that the problem does not exist in the most recent kernels you could find? Or do I misunderstand you there?

alkisg commented 7 years ago

Hi Wouter, let me phrase it better,

the problem started with kernel 4.12 and is still happening in the most recent kernel I could find, which was 4.14-rc2.

yoe commented 7 years ago

Oh, okay then.

@josefbacik, any idea?

josefbacik commented 7 years ago

Oops, I'll take a look in the morning.

josefbacik commented 7 years ago

Oh actually I think this is my timeout patch that I fixed later, can you try Linus master?

alkisg commented 7 years ago

Hi Josef, I tried the latest vanilla kernel from Ubuntu's daily builds and the problem still happens there: http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/ cod/tip/daily/2017-10-05 42b76d0e6b1fe0fcb90e0ff6b4d053d50597b031

I.e. Linux master https://github.com/torvalds/linux/commit/42b76d0e6b1fe0fcb90e0ff6b4d053d50597b031

josefbacik commented 7 years ago

Alright got it nailed down, sorry about that, apparently all of my regression tests only do the netlink interface, save the one that checks that the ioctl and netlink interfaces behave with eachother. I've submitted the patch

[PATCH] nbd: don't set the device size until we're connected

to fix it and cc'ed stable so it'll make its way back to distro kernels. @yoe sorry, I thought I had updated the mailinglist but accidentally sent it to the old sf list. I've fixed my stuff so that won't happen again.

alkisg commented 7 years ago

Thanks a lot! I'll try to detect when it lands on daily builds, so that I can test it.

alkisg commented 6 years ago

I tested on Ubuntu 18.04 with the upcoming 4.15 kernel and it works fine there. Closing, thank you. :)

chenhaiq commented 6 years ago

I still can reproduce this problem in ubuntu 1804, kernel 4.15.0-36-generic: cmd to reproduce:

qemu-img create -f qcow2 sample.img 10G
modprobe nbd

udevadm settle
qemu-nbd -c /dev/nbd0 sample.img
udevadm settle
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.003028] block nbd0: NBD_DISCONNECT
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.003147] block nbd0: shutting down sockets
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.005229] nbd0: detected capacity change from 0 to 10737418240
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.005296] print_req_error: 2 callbacks suppressed
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.005297] print_req_error: I/O error, dev nbd0, sector 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.006937] buffer_io_error: 2 callbacks suppressed
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.006939] Buffer I/O error on dev nbd0, logical block 0, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.009225] print_req_error: I/O error, dev nbd0, sector 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.010670] Buffer I/O error on dev nbd0, logical block 0, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.012777] print_req_error: I/O error, dev nbd0, sector 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.014259] Buffer I/O error on dev nbd0, logical block 0, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.015965] ldm_validate_partition_table(): Disk read failed.
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.015973] print_req_error: I/O error, dev nbd0, sector 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.018006] Buffer I/O error on dev nbd0, logical block 0, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.020183] print_req_error: I/O error, dev nbd0, sector 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.021638] Buffer I/O error on dev nbd0, logical block 0, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.023379] print_req_error: I/O error, dev nbd0, sector 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.024763] Buffer I/O error on dev nbd0, logical block 0, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.026803] print_req_error: I/O error, dev nbd0, sector 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.028193] Buffer I/O error on dev nbd0, logical block 0, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.030244] Dev nbd0: unable to read RDB block 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.031491] print_req_error: I/O error, dev nbd0, sector 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.032871] Buffer I/O error on dev nbd0, logical block 0, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.034815] print_req_error: I/O error, dev nbd0, sector 0
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.036188] Buffer I/O error on dev nbd0, logical block 0, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.038083] print_req_error: I/O error, dev nbd0, sector 24
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.039485] Buffer I/O error on dev nbd0, logical block 3, async page read
Oct 15 11:24:21 i-uvwxiqu9 kernel: [   44.041432]  nbd0: unable to read partition table
bsdkurt commented 5 years ago

@chenhaiq I also see this problem. It is a bug in the nbd kernel module in nbd_bdev_reset() where it sets the size to 0 but the capacity is unchanged, then it calls blkdev_reread_part(). blkdev_reread_part() detects a capacity change and attempts to read the partitions on the nbd device shutting down, causing I/O errors and the unable to read partition table message.

This was fixed in 4.18 in this commit: https://github.com/torvalds/linux/commit/fe1f9e6659ca6124f500a0f829202c7c902fab0c#diff-bc9273bcb259fef182ae607a1d06a142

@josefbacik Can this fix be backported to 4.14 stable?