NetworkBlockDevice / nbd

Network Block Device
GNU General Public License v2.0
459 stars 119 forks source link

NBD fails to reconnect if started in initramfs #95

Open jpf91 opened 5 years ago

jpf91 commented 5 years ago

Maybe this is more a RHEL/dracut bug, but I'm not really sure how to find the root cause of this problem.

Using Centos 7.6 (3.10.0-957.1.3.el7.x86_64, nbd 3.14) and booting with nbd root filesystem and nbd client options -p -t10, the nbd-client fails to reconnect after a network hickup. If the nbd-client is ever restarted after boot (i.e. nbd-client -d /dev/nbd0 && nbd-client ... -p -t10 /dev/nbd0) the newly started nbd-client recovers just fine on network failures.

Adding -nofork and redirecting the nbd-client stderr output I was able to capture the following output of an initramfs started nbd-client:

CentOS Linux 7 (Core)
Kernel 3.10.0-957.1.3.el7.x86_64 on an x86_64

localhost login: [   58.382947] fuse init (API version 7.22)
[  151.366971] block nbd0: Receive control failed (result -104)
[  151.370055] block nbd0: shutting down socket
[  151.372247] block nbd0: queue cleared
[  151.373853] nbd,3371: Kernel call returned: 104 Reconnecting
[  151.395901] Error: Socket failed: Connection refused
[  151.395901] Exiting.
[  151.949925] e1000: ens33 NIC Link is Down
[  152.397085]  Reconnecting
[  157.989809] e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[  157.992670] IPv6: ADDRCONF(NETDEV_CHANGE): ens33: link becomes ready
[  158.007023] block nbd0: Attempted send on closed socket
[  158.008705] blk_update_request: I/O error, dev nbd0, sector 18518264
[  158.010371] XFS (nbd0): metadata I/O error: block 0x11a90f8 ("xfs_trans_read_buf_map") error 5 numblks 8
[  158.012997] block nbd0: Attempted send on closed socket
[  158.014379] blk_update_request: I/O error, dev nbd0, sector 55372432
[  158.016040] XFS (nbd0): metadata I/O error: block 0x34cea90 ("xfs_trans_read_buf_map") error 5 numblks 8
[  158.041987] block nbd0: Attempted send on closed socket
[  158.043606] blk_update_request: I/O error, dev nbd0, sector 18518264
[  158.045309] XFS (nbd0): metadata I/O error: block 0x11a90f8 ("xfs_trans_read_buf_map") error 5 numblks 8
[  158.048266] block nbd0: Attempted send on closed socket
[  158.049547] blk_update_request: I/O error, dev nbd0, sector 55372432
[  158.051324] XFS (nbd0): metadata I/O error: block 0x34cea90 ("xfs_trans_read_buf_map") error 5 numblks 8
[  158.118247] block nbd0: Attempted send on closed socket
[  158.119665] blk_update_request: I/O error, dev nbd0, sector 197568
[  158.121388] XFS (nbd0): metadata I/O error: block 0x303c0 ("xfs_trans_read_buf_map") error 5 numblks 32
[  158.123793] XFS (nbd0): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[  158.125818] block nbd0: Attempted send on closed socket
[  158.127059] blk_update_request: I/O error, dev nbd0, sector 55372432
[  158.133341] XFS (nbd0): metadata I/O error: block 0x34cea90 ("xfs_trans_read_buf_map") error 5 numblks 8
[  158.133786] block nbd0: Attempted send on closed socket
[  158.133788] blk_update_request: I/O error, dev nbd0, sector 55694096
[  158.133793] block nbd0: Attempted send on closed socket
[  158.133793] blk_update_request: I/O error, dev nbd0, sector 55694096
[  158.133821] block nbd0: Attempted send on closed socket
[  158.133821] blk_update_request: I/O error, dev nbd0, sector 55694096
[  159.419970] Error: Socket failed: Connection refused
[  159.419970] Exiting.
[  160.422105]  Reconnecting
[  160.424333] Error: Socket failed: Connection refused
[  160.424333] Exiting.
[  161.428520]  Reconnecting
[  161.431024] Error: Socket failed: Connection refused
[  161.431024] Exiting.
[  162.433949]  Reconnecting
[  162.436331] Error: Socket failed: Connection refused
[  162.436331] Exiting.
[  163.438283]  Reconnecting
[  163.441123] Error: Socket failed: Connection refused
[  163.441123] Exiting.
[  164.443784]  Reconnecting
[  164.446080] Error: Socket failed: Connection refused
[  164.446080] Exiting.
[  165.447956]  Reconnecting
[  165.450821] Error: Socket failed: Connection refused
[  165.450821] Exiting.
[  166.454441]  Reconnecting
[  166.457184] Error: Socket failed: Connection refused
[  166.457184] Exiting.
[  167.460289]  Reconnecting
[  167.462858] Error: Socket failed: Connection refused
[  167.462858] Exiting.
[  168.464403]  Reconnecting
[  168.466996] Error: Socket failed: Connection refused
[  168.466996] Exiting.
[  169.469852]  Reconnecting
[  169.472117] Error: Socket failed: Connection refused
[  169.472117] Exiting.
[  170.479696]  Reconnecting
[  170.482395] Error: Socket failed: Connection refused
[  170.482395] Exiting.
[  171.492696]  Reconnecting
[  171.494984] Error: Socket failed: Connection refused
[  171.494984] Exiting.
[  172.505718]  Reconnecting
[  172.508425] Error: Socket failed: Connection refused
[  172.508425] Exiting.
[  173.518706]  Reconnecting
[  173.521055] Error: Socket failed: Connection refused
[  173.521055] Exiting.
[  174.531709]  Reconnecting
[  174.534078] Error: Socket failed: Connection refused
[  174.534078] Exiting.
[  175.544887]  Reconnecting
[  175.547474] Error: Socket failed: Connection refused
[  175.547474] Exiting.
[  176.557703]  Reconnecting
[  176.559988] Error: Socket failed: Connection refused
[  176.559988] Exiting.
[  177.570708]  Reconnecting
[  177.574317] Error: Socket failed: Connection refused
[  177.574317] Exiting.
[  178.583621]  Reconnecting
[  178.587565] Error: Socket failed: Connection refused
[  178.587565] Exiting.
[  179.596929]  Reconnecting
[  179.600292] Error: Socket failed: Connection refused
[  179.600292] Exiting.
[  180.609744]  Reconnecting
[  180.613501] Error: Socket failed: Connection refused
[  180.613501] Exiting.
[  181.622742]  Reconnecting
[  181.624653] Error: Socket failed: Connection refused
[  181.624653] Exiting.
[  182.635739]  Reconnecting
[  182.639482] Error: Socket failed: Connection refused
[  182.639482] Exiting.
[  183.648633]  Reconnecting
[  183.650941] Error: Socket failed: Connection refused
[  183.650941] Exiting.
[  184.661574]  Reconnecting
[  185.674957] Error: Socket failed: Connection refused
[  185.674957] Exiting
[  186.095634] e1000: ens33 NIC Link is Down
[  186.687704]  Reconnecting
[  190.117042] e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[  193.714773] Error: Cannot open NBD: No such file or directory
[  193.714773] Exiting.

So it fails in https://github.com/NetworkBlockDevice/nbd/blob/master/nbd-client.c#L1317. I also tried to reproduce using newer kernels, however there the nbd-client was always stopped after the initramfs finished executing, so the situation was actually much worse.

I think we'll try to use iscsi for our root devices now (especially as the nbd behaviour is not particularly nice when a connection drops: We get lot's of IO errors until the connection is restored which means most running programs will crash), but I still wanted to file this report as it may help others.

yoe commented 5 years ago

Can you please try to reproduce this using a more recent version of NBD? 3.14 is quite old, and I think I fixed some issues related to persist mode since then.

chabad360 commented 3 years ago

I've been able to consistently reproduce this issue but without the first line of error. My errors just start with block nbd0: shutting down socket. The issue only showed up in version 3.21.