Open jpf91 opened 5 years ago
Can you please try to reproduce this using a more recent version of NBD? 3.14 is quite old, and I think I fixed some issues related to persist mode since then.
I've been able to consistently reproduce this issue but without the first line of error. My errors just start with block nbd0: shutting down socket
. The issue only showed up in version 3.21.
Maybe this is more a RHEL/dracut bug, but I'm not really sure how to find the root cause of this problem.
Using Centos 7.6 (3.10.0-957.1.3.el7.x86_64, nbd 3.14) and booting with nbd root filesystem and nbd client options
-p -t10
, the nbd-client fails to reconnect after a network hickup. If the nbd-client is ever restarted after boot (i.e.nbd-client -d /dev/nbd0 && nbd-client ... -p -t10 /dev/nbd0
) the newly started nbd-client recovers just fine on network failures.Adding
-nofork
and redirecting the nbd-client stderr output I was able to capture the following output of an initramfs started nbd-client:So it fails in https://github.com/NetworkBlockDevice/nbd/blob/master/nbd-client.c#L1317. I also tried to reproduce using newer kernels, however there the nbd-client was always stopped after the initramfs finished executing, so the situation was actually much worse.
I think we'll try to use iscsi for our root devices now (especially as the nbd behaviour is not particularly nice when a connection drops: We get lot's of IO errors until the connection is restored which means most running programs will crash), but I still wanted to file this report as it may help others.