linux>3.19.x fails to mount root ZFS on NVMe during stage 1

evujumenuk commented 9 years ago

When using any kernel strictly newer than pkgs.linuxPackages_3_19 as boot.kernelPackages, the entire configuration fails to mount the root-on-ZFS filesystem during stage 1.

Here is a transcript of a failing boot:

<<< NixOS Stage 1 >>>

loading module spl...
loading module zfs...
loading module dm_mod...
loading module ast...
loading module pci_stub...
loading module vfio_pci...
loading module vfio_iommu_type1...
running udev...
starting version 217
starting device mapper and LVM...
importing root ZFS pool "POOLbakemono"...
cannot import 'POOLbakemono': no such pool available
mounting POOLbakemono/FS on /...
mount: mounting POOLbakemono/FS on /mnt-root/ failed: No such file or directory

An error occurred in stage 1 of the boot process, which must mount the
root filesystem on `/mnt-root' and then start stage 2. Press one
of the following keys:

  r) to reboot immediately
  *) to ignore the error and continue

(Yes, I use Hungarian notation for pool and volume names. Stop judging!)

Pressing the R and Enter keys results in a few dozen newlines being printed and a subsequent reboot. Simply pressing, e.g., just Enter results in the following:

Continuing...
BusyBox v1.23.2 () multi-call binary.

Usage: switch_root [-c /dev/console] NEW_ROOT NEW_INIT [ARGS]

Free initramfs and switch to another root fs:
chroot to NEW_BOOT, delete all in /, move NEW_ROOT to /,
execute NEW_INIT. PID must be 1. NEW_ROOT must be a mountpoint.

        -c DEV  Reopen stdio to DEV after switch

I guess switch_root is being called with too few arguments.

I can boot just fine with 3.19.x and 3.18.x. I have reverted to using 3.18.x for now since 3.19.x has fallen out of nixpkgs.

I'd love to include the generated /etc/fstab, however, one of the disadvantages to 3.18.x prevents me from feeding any input to that machine's console. Right now, I am posting this from a virtual machine on that host that's had one of the host's USB controllers passed through to it via VFIO. The one which is still assigned to the host proper has had its driver die:

[80175.570386] xhci_hcd 0000:05:00.0: HC died; cleaning up

Which simply means that I'll need to reboot the entire machine to do something like cat /etc/fstab. (sigh)

So, yeah. Newer kernels won't mount the root (Z)FS whereas older ones just do it :)

NixOS unstable channel revision 7ae05ed is being used.

spinus commented 9 years ago

Works for me. I have tried zfs on root and zfs on encrypted root. I had similar errors as you but most of the time it was my configuration issues. Please show your config for disks/grub/zfs.

evujumenuk commented 9 years ago

So, I just rebooted and can give some more information.

/etc/fstab:

# This is a generated file.  Do not edit!

# Filesystems.
POOLbakemono/FS / zfs defaults 0 0
POOLbakemono /POOLbakemono zfs defaults 0 0
POOLbkmn-sata-1v2m /POOLbkmn-sata-1v2m zfs defaults 0 0
systemd-1 /boot autofs defaults 0 2

# Swap devices.
/dev/disk/by-uuid/ac0f9e39-368c-4e37-bfe7-17c7e08e2e43 none swap

zfs list outputs the following:

NAME                            USED  AVAIL  REFER  MOUNTPOINT
POOLbakemono                    227G   130G    96K  /POOLbakemono
POOLbakemono/FS                12.1G   130G  12.1G  legacy
POOLbakemono/VOLwin10C         66.0G   132G  63.8G  -
POOLbakemono/VOLwin8.1C         149G   144G   134G  -
POOLbkmn-sata-1v2m             1.40T   539G    96K  /POOLbakemono-sata-1v2m
POOLbkmn-sata-1v2m/VOLtest      528G   749G   318G  -
POOLbkmn-sata-1v2m/VOLwin8.1D   264G   551G   252G  -
POOLbkmn-sata-1v2m/VOLwin8.1G   109G   594G  54.0G  -
POOLbkmn-sata-1v2m/VOLwinE      528G   767G   300G  -

@spinus: Not using grub. This machine boots through UEFI with gummiboot. The kernel thus sits in an EFI system partition. I've posted the rest above.

Those mountpoints sure sound bogus though. I had created those filesystems using -o mountpoint=none as recommended in https://nixos.org/wiki/ZFS_on_NixOS#How_to_install_NixOS_on_a_ZFS_root_filesystem...

EDIT Right. It's just the pools themselves that seem to have a mountpoint set. Maybe I'm missing something, but I don't think that pools should require a mountpoint. :)

wizeman commented 9 years ago

I think the problem here is that ZFS is not even finding your pool when you boot with a newer kernel:

importing root ZFS pool "POOLbakemono"...
cannot import 'POOLbakemono': no such pool available

This usually happens when the kernel cannot see your storage devices (e.g. due to missing drivers, due to a race condition with udev, etc...). Are you using exactly the same configuration with both kernels (3.18.x and 3.19.x), only changing the kernel version?

Can you try booting with kernel 3.19.x, but this time adding:

boot.kernelParams = [ "boot.shell_on_fail" ];

... to your configuration.nix? When mounting the root filesystem fails, you can use the emergency shell and see what's going on, e.g. you can try running zpool import POOLbakemono manually, see if the kernel can see your storage devices, etc.

Just as an FYI, I'm also using a ZFS root pool with kernel 4.1.x, with ZFS on top of an encrypted luks device, and also in UEFI mode (with gummiboot, currently). As you can see, it's supposed to work fine :-)

evujumenuk commented 9 years ago

I'll report back when the machine becomes unutilized (and I can reboot without clamoring users) later today. Yes, the only things that change in configuration.nix are the digits in pkgs.linuxPackages_x_y. I literally just cursor over them in vim and press ^X and ^A as is appropriate :) One more thing. The POOLbakemono pool currently resides on just one vdev, which is a partition on an NVMe device. Maybe something specific to NVMe changed in recent kernels?

evujumenuk commented 9 years ago

I was able to sneak in a reboot. Here's what I found:

NVMe device nodes do show up in /dev/ - all of them:

/dev/nvme0
/dev/nvme0n1
one for each partition: /dev/nvme0n1p${PARTITIONNUMBER}

zpool import sort of works (but not really):

/ # zpool import POOLbakemono
cannot mount 'POOLbakemono': Input/output error
/ # zpool import POOLbakemono
cannot import 'POOLbakemono': a pool with that name already exists
use the form 'zpool import <pool | id> <newpool>' to give it a new name

After that, the pool seems to be online. Nevertheless, Input/output error is not something I'd like to ignore without knowing the cause.

Anything else I should try?

EDIT I've rebooted the machine to the old kernel for the time being.

EDIT Could this be a result of the bogus mountpoints defined for the pool itself?

EDIT I just tried to import my pools using the -N flag, and this just works without throwing any error. In other words:

/ # zpool import -N POOLbakemono
/ # zpool import -N POOLbkmn-sata-1v2m
/ #

I imported my other pool as well though it's not essential for booting. As one can see, no error is reported.

EDIT I've gone through my journal after an old-kernel boot and found a few things related to ZFS that may provide hints.

Nov 14 14:38:23 bakemono mount[818]: filesystem 'POOLbakemono' cannot be mounted using 'mount'.
Nov 14 14:38:23 bakemono mount[818]: Use 'zfs set mountpoint=legacy' or 'zfs mount POOLbakemono'.
Nov 14 14:38:23 bakemono mount[818]: See zfs(8) for more information.

Nov 14 14:38:23 bakemono zfs-import-POOLbkmn-sata-1v2m-start[937]: cannot open 'POOLbkmn-sata-1v2m': no such pool

Nov 14 14:38:25 bakemono mount[1144]: filesystem 'POOLbkmn-sata-1v2m' cannot be mounted using 'mount'.
Nov 14 14:38:25 bakemono mount[1144]: Use 'zfs set mountpoint=legacy' or 'zfs mount POOLbkmn-sata-1v2m'.
Nov 14 14:38:25 bakemono mount[1144]: See zfs(8) for more information.
Nov 14 14:38:25 bakemono systemd[1]: POOLbkmn\x2dsata\x2d1v2m.mount mount process exited, code=exited status=1

evujumenuk commented 9 years ago

Posting a new comment because all that mountpoint jazz turns out to be a diversion from the main issue. I rebooted into a new kernel and decided to look into dmesg (why I didn't think of this earlier, I don't know) and found something interesting:

[   2.487887] clocksource: Switched to clocksource tsc
[   7.075273]  nvme0n1: p1 p2 p3
[  18.660568] random: nonblocking pool is initialized

Which indicates that NVMe is initialized almost five seconds after almost everything else during that boot has finished. Of course, when I start a rescue shell and import that pool manually, everything looks fine since five seconds is not that much time - for a human.

I found a commit in the kernel git repository that may be pertinent: https://github.com/torvalds/linux/commit/2e1d8448196ba85cd78a18723413a3c92aabe0f3

This commit has been in the kernel since 4.0-rc1, which lines up with when my problems started.

jcumming commented 9 years ago

Sounds like we need to wait for the NVMe driver to finish enumerating all devices before starting the zpool import. This is not a unique problem to NixOS, other distros have needed to resolve this.

The ordering can be done with systemd dependencies.

I'll research how to have the NVMe driver raise an objection while it is enumerating, and lower it when the devices are available.

evujumenuk commented 9 years ago

All of this happens during stage 1, and AFAICS systemd does not yet run then. The stage 1 shell script could simply wait in a loop until the pool containing the root FS became available. A ten second timeout wouldn't hurt, either. The latter seems to be what other distributions do when mount cannot proceed.

In an ideal world, one could wait for the relevant block device's device driver module to signal when it has completely finished probing. It may even be possible for NVMe, but I suspect that it's not possible to do that in the general case because from udev's point of view devices can pop up and vanish all the time, making the notion of "finishing all processing" meaningless.

Another thought. If mounting the root (Z)FS was managed through ZFS (zfs set mountpoint=/ ${SOMEZFS}), as opposed to what is done currently with legacy mountpoints, one could probably simply wait for /mnt-root's mount to pop into existence, as ZFS does its own event processing.

EDIT But I suspect this may not be as easy since the mountpoint shifts between stages (/mnt-root vs. /).

EDIT I also found two interesting NixOS tidbits surrounding the core issue of slowly initialized root FS block devices:

http://yellowgrass.org/issue/NixOS/48 recognized this problem nearly nine years ago. This is a sufficient amount of time for me to not be entirely confident in its suggested workaround: using exec /init to simply restart stage 1 from the beginning. No idea whether that's still safe.
https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/system/boot/stage-1-init.sh#L368 includes provisions for non-"pseudo-devices" to be waited for before mounting. That's a bit late for zpool import but there is an opportunity for reuse and/or refactoring.

jcumming commented 9 years ago

Hrm. Is there no udev settle in stage 1?

/me wanders off to look.

jcumming commented 9 years ago

There is a udevadm settle in the stage 1. I'll harass the NVMe driver maintains about what's supposed to happen here.

If that fails, we might hack a workaround and inject a sleep 10 from the zfs.nix module.

jcumming commented 9 years ago

The fix recommended by the NVMe driver maintainers is to pass udev.event-timeout=120 rd.udev.event-timeout=120 on the kernel command line, which means that zfs.nix should probably inject that.

keithbusch commented 9 years ago

I'm not familiar with using zfs with nvme, so I might be off base here. If the "udev" params are not helping (which I thought were for systemd), try adding "rootdelay=120" instead.

evujumenuk commented 9 years ago

Thanks for the suggestions. I tried booting a new kernel with the first suggested set of arguments, then with the second singular parameter. Unfortunately, neither of these has any observable effect - I didn't see any delay during boot, and, of course, NVMe still added the needed device nodes way after zpool import was finished.

Is the first set really intended for systemd?

EDIT Looks like at least one of the parameters is being read here: https://github.com/systemd/systemd/blob/master/src/udev/udevd.c#L1360

EDIT My kernel command line does not even define root=. Is rootdelay= even effective in such a situation?

jcumming commented 9 years ago

It's been a while since I hacked on the udev boot stuff, but udevadm trigger is supposed to get the device drivers to emit events, and then udevadm settle waits for all of the events to clear.

I wonder if the --trigger=add is right?

evujumenuk commented 8 years ago

Let's raise this blocker from the dead:

According to https://github.com/zfsonlinux/zfs/issues/2199#issuecomment-38308916, udevadm settle does not really wait for uevents to have finished processing even though that's exactly what it says on the tin. I'm not 100% sure, but since linux>=4.0 NVMe seems to show the same insidious behaviour USB has been exhibiting for years in that they both indicate everything's done, only to pop up device nodes seconds later. Solutions to that seem to boil down to two basic approaches:

wait for some (configurable) amount of time that always seems to be way too long or just a little too short - pointless waiting or catastrophic failure? Take your pick;
zpool import in a loop until the root filesystem appears or some reasonable timeout (e.g., 30 seconds) elapses. Waiting for one second between attempts doesn't look very objectionable.

As https://github.com/zfsonlinux/zfs/issues/330 implies, this would actually need to wait for all constituent devices of a zpool to show up.

I really don't know what kernel devs think userspace is supposed to do in this case.

A variation on this problem (this time, with USB) is discussed in https://bugs.archlinux.org/task/11571. They go for the two workarounds described above. Here's where I first read that udevadm settle doesn't really do what one'd expect.

keithbusch commented 8 years ago

Heh, I guess we can't please everyone. :)

The driver used to complete device discovery serially, blocking initialization until complete. Then user space software decided to kill the init process because it takes "too long" with enough devices present. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1191726 for the user space devs to explain why, and the conclusion was the kernel driver needs to discover NVMe faster. The only way to go faster is to parallelize, so the driver completes device "probe" before the storage is surfaced so that it can move on to the next controller.

You say USB has similar behavior and problems, but I assume other storage device under the SCSI stack works correctly. Not sure if/how they're synchronizing with user space in some way, but will look into it.

keithbusch commented 8 years ago

Looks like SCSI uses an async schedule where NVMe uses a work queue. I'm guessing that's the key difference based on how "wait_for_device_probe()" synchronizes.

domenkozar commented 8 years ago

@evujumenuk what's the status of this?

evujumenuk commented 8 years ago

Well, since no one seems to know (or maybe, no one is inclined to tell) how userspace is supposed to wait for the kernel to finish probing all devices, the status is exactly as it was in November. I can try nixos-rebuild --upgradeing and rebooting the machine in question at the next opportunity, but with exactly one kernel release since the last time I looked, and the changelog of that release not containing anything like 'undid that horrible NVMe async probe thing that we shall never again speak of' or somesuch ;), I'm not too hopeful that anything has changed, really. There's still the possibility that udev does the right thing, now, but I didn't find anything to that effect in the systemd NEWS, either.

The machine continues to run 3.18.x, for now.

evujumenuk commented 8 years ago

After spending a few more brain cycles on this problem, I think that a flag to zpool import that made it block until the devices underlying the pool to be imported all become available would have resolved it nicely. Unfortunately, such a flag apparently does not exist, and, due to the intended semantics of importing, is unlikely to ever be implemented; import and export are meant to persistently {associate a pool with,dissociate a pool from} the system. The way we re-import pools on every startup would constitute a kind of "off-label use" of that functionality. Other distributions just autoimport everything at boot and give the users a few knobs to tweak the eligible list of pools.

In an ideal world, the init script would simply block until the root filesystem magically appeared, because zevent would receive udev events and automatically import a pool once its constituent devices showed up, subject to administratively defined {black,white}lists. However, a feature like that is only slated to be implemented for zfsonlinux 0.8.0 or even later.

zfs in nixpkgs could block importing at startup until a (configurable) list of vdevs has appeared completely. The administrator would need to keep this list updated across pool changes, which makes this approach a bit hacky. However, startup would not be delayed beyond what's necessary, and as long as the list is accurate, importing would always work. By default, the list of waited-on devices would be empty -- and behavior for existing users unchanged, as a result.

Thoughts?

spinus commented 8 years ago

I have the same issue right now.

I don't have an issue on 3 physical machines (laptop, pc). I have issues inside VM on XEN.

rootdelay didn't work for me
udev.event-timeout=120 didn't work for me
rd.udev.event-timeout=120 didn't work for me

If I boot machine and pass kernel param `boot.shell_on_fail and login, wait a moment and than "ctrl+d" it will load the system properly. So this might be some async/race condition here as mentioned.

Baughn commented 8 years ago

Same issue here. I can't add very much, but here's the workaround I'm using:

  boot.initrd.postDeviceCommands = ''
    echo -n 'Waiting for NVMe'
    while [ ! -e /dev/nvme0n1 ]; do
      sleep 0.1
      echo -n .
    done
    echo
    zpool import -a
  '';

There's no hook in-between udevadm settle and zpool import, otherwise that last line wouldn't be needed. By all appearances, it takes 3-5 seconds for nvme0n1 to appear after nvme0 appears.

A slightly more principled solution would be to retry the import until it works, with 0.1s in between or so. It's not like there's any point in continuing if root isn't there.

Baughn commented 8 years ago

...sorry about the spam!

wizeman commented 8 years ago

Since #16901 has been merged and backported to the stable release (thanks @Baughn!), this problem should no longer happen in most cases, so I'm closing. If you continue to experience this issue after updating, feel free to reopen (but keep in mind that it may take a few days until the channels are updated to include the fix).

NixOS / nixpkgs

linux>3.19.x fails to mount root ZFS on NVMe during stage 1 #11003