dracutdevs / dracut

dracut the event driven initramfs infrastructure
https://github.com/dracutdevs/dracut/wiki
GNU General Public License v2.0
599 stars 399 forks source link

NBD disconnects during switch_root #1032

Closed chabad360 closed 3 years ago

chabad360 commented 3 years ago

Describe the bug Once switch_root starts, a bunch of errors are thrown that indicate a read error on the device /dev/nbd0, further investigation shows that nbd seems to decide to stop talking the server (kernel: block nbd0: shutting down sockets). If I break before switch_root, /sysroot is mounted correctly, and I can view files in it. But once I exit the shell, everything just falls apart.

Distribution used Arch Linux (with a custom kernel)

Dracut version 051

Init system SystemD

To Reproduce

Expected behavior Clean boot.

Additional context

rdsosreport.txt

haraldh commented 3 years ago

rd.retry=2 is way too small. Please remove it completely.

haraldh commented 3 years ago
[   10.511859] rockpix systemctl[343]: Failed to switch root: Specified switch root path '/sysroot' does not seem to be an OS tree. os-release file is missing.

So, maybe your installation is missing /etc/os-release ??

chabad360 commented 3 years ago

rd.retry=2 is way too small. Please remove it completely.

I only stick that there so that it doesn't take a while for dracut to decide the boot failed. I'm not quite sure why after it times out waiting for /dev/nbd0 it tries to mount it anyway. But I'll give it a shot with a value of 10.

[   10.511859] rockpix systemctl[343]: Failed to switch root: Specified switch root path '/sysroot' does not seem to be an OS tree. os-release file is missing.

So, maybe your installation is missing /etc/os-release ??

Inserting rd.break and inspecting the /sysroot prior to switch_root shows that /etc/os-release is there.

chabad360 commented 3 years ago

rd.retry=2 is way too small. Please remove it completely.

Didn't fix it...

chabad360 commented 3 years ago

If you'd like, I can upload the image I'm trying to boot from.

chabad360 commented 3 years ago

Well, part of my problem is solved, reverting to nbd version 3.20 seems to have fixed the issue partly. By partly, I mean that now the boot fails because nbd0 is already in use.

johannbg commented 3 years ago

Recent versions of nbd-client use the netlink interface to configure the NBD device which has been uhum having side effects and that's most likely affecting us here as well. We need to update the module to reflect that I think.

johannbg commented 3 years ago

@chabad360 what happens if you add "-nonetlink" or "-L" to the netroot cmdline as the last option in that line, if that does not work can you add either of those in the nbd-client lines, in nbdroot.sh , in the module directory. ( basically the client needs to be started with this as a potential workaround.)

chabad360 commented 3 years ago

Hmm, it seems I spoke too soon (it actually booted successfully once or twice). It seems despite adding -nonetlink, it seems that I'm getting an issue where the system breaks with Warning: /dev/root does not exist. I pretty sure it's not the fault of nbd-client because this wasn't (well, it happened randomly) an issue with nbd 3.20.

chabad360 commented 3 years ago

It seems my issue was solved by changing the --service-type from forking to oneshot, and by making my cmdline:

ip=dhcp root=/dev/nbd0 netroot=nbd:192.168.254.20:kiosk:btrfs rootflags=rw,noatime,compress=lzo,ssd
chabad360 commented 3 years ago

Again, spoke too soon. My boot process now randomly fails with:

[ failed ] Unable to start nbd nbd0

I'm going to try to investigate tonight, see if I can figure out what's causing it.

Thanks a lot for your help so far.

johannbg commented 3 years ago

@chabad360 did you figure out why the unit failed to start the nbd service

chabad360 commented 3 years ago

no... And I won't be able to look into it till next week.

If you want, I can upload the image that I'm booting and you can have a look around (nothing confidential on it).