Closed evujumenuk closed 8 years ago
Works for me. I have tried zfs on root and zfs on encrypted root. I had similar errors as you but most of the time it was my configuration issues. Please show your config for disks/grub/zfs.
So, I just rebooted and can give some more information.
/etc/fstab
:
# This is a generated file. Do not edit!
# Filesystems.
POOLbakemono/FS / zfs defaults 0 0
POOLbakemono /POOLbakemono zfs defaults 0 0
POOLbkmn-sata-1v2m /POOLbkmn-sata-1v2m zfs defaults 0 0
systemd-1 /boot autofs defaults 0 2
# Swap devices.
/dev/disk/by-uuid/ac0f9e39-368c-4e37-bfe7-17c7e08e2e43 none swap
zfs list
outputs the following:
NAME USED AVAIL REFER MOUNTPOINT
POOLbakemono 227G 130G 96K /POOLbakemono
POOLbakemono/FS 12.1G 130G 12.1G legacy
POOLbakemono/VOLwin10C 66.0G 132G 63.8G -
POOLbakemono/VOLwin8.1C 149G 144G 134G -
POOLbkmn-sata-1v2m 1.40T 539G 96K /POOLbakemono-sata-1v2m
POOLbkmn-sata-1v2m/VOLtest 528G 749G 318G -
POOLbkmn-sata-1v2m/VOLwin8.1D 264G 551G 252G -
POOLbkmn-sata-1v2m/VOLwin8.1G 109G 594G 54.0G -
POOLbkmn-sata-1v2m/VOLwinE 528G 767G 300G -
@spinus: Not using grub. This machine boots through UEFI with gummiboot. The kernel thus sits in an EFI system partition. I've posted the rest above.
Those mountpoints sure sound bogus though. I had created those filesystems using -o mountpoint=none
as recommended in https://nixos.org/wiki/ZFS_on_NixOS#How_to_install_NixOS_on_a_ZFS_root_filesystem...
EDIT Right. It's just the pools themselves that seem to have a mountpoint set. Maybe I'm missing something, but I don't think that pools should require a mountpoint. :)
I think the problem here is that ZFS is not even finding your pool when you boot with a newer kernel:
importing root ZFS pool "POOLbakemono"...
cannot import 'POOLbakemono': no such pool available
This usually happens when the kernel cannot see your storage devices (e.g. due to missing drivers, due to a race condition with udev, etc...). Are you using exactly the same configuration with both kernels (3.18.x and 3.19.x), only changing the kernel version?
Can you try booting with kernel 3.19.x, but this time adding:
boot.kernelParams = [ "boot.shell_on_fail" ];
... to your configuration.nix
? When mounting the root filesystem fails, you can use the emergency shell and see what's going on, e.g. you can try running zpool import POOLbakemono
manually, see if the kernel can see your storage devices, etc.
Just as an FYI, I'm also using a ZFS root pool with kernel 4.1.x, with ZFS on top of an encrypted luks device, and also in UEFI mode (with gummiboot, currently). As you can see, it's supposed to work fine :-)
I'll report back when the machine becomes unutilized (and I can reboot without clamoring users) later today. Yes, the only things that change in configuration.nix
are the digits in pkgs.linuxPackages_x_y
. I literally just cursor over them in vim and press ^X
and ^A
as is appropriate :) One more thing. The POOLbakemono
pool currently resides on just one vdev, which is a partition on an NVMe device. Maybe something specific to NVMe changed in recent kernels?
I was able to sneak in a reboot. Here's what I found:
NVMe device nodes do show up in /dev/
- all of them:
/dev/nvme0
/dev/nvme0n1
/dev/nvme0n1p${PARTITIONNUMBER}
zpool import
sort of works (but not really):
/ # zpool import POOLbakemono
cannot mount 'POOLbakemono': Input/output error
/ # zpool import POOLbakemono
cannot import 'POOLbakemono': a pool with that name already exists
use the form 'zpool import <pool | id> <newpool>' to give it a new name
After that, the pool seems to be online. Nevertheless, Input/output error
is not something I'd like to ignore without knowing the cause.
Anything else I should try?
EDIT I've rebooted the machine to the old kernel for the time being.
EDIT Could this be a result of the bogus mountpoints defined for the pool itself?
EDIT I just tried to import my pools using the -N
flag, and this just works without throwing any error. In other words:
/ # zpool import -N POOLbakemono
/ # zpool import -N POOLbkmn-sata-1v2m
/ #
I imported my other pool as well though it's not essential for booting. As one can see, no error is reported.
EDIT I've gone through my journal after an old-kernel boot and found a few things related to ZFS that may provide hints.
Nov 14 14:38:23 bakemono mount[818]: filesystem 'POOLbakemono' cannot be mounted using 'mount'.
Nov 14 14:38:23 bakemono mount[818]: Use 'zfs set mountpoint=legacy' or 'zfs mount POOLbakemono'.
Nov 14 14:38:23 bakemono mount[818]: See zfs(8) for more information.
Nov 14 14:38:23 bakemono zfs-import-POOLbkmn-sata-1v2m-start[937]: cannot open 'POOLbkmn-sata-1v2m': no such pool
Nov 14 14:38:25 bakemono mount[1144]: filesystem 'POOLbkmn-sata-1v2m' cannot be mounted using 'mount'.
Nov 14 14:38:25 bakemono mount[1144]: Use 'zfs set mountpoint=legacy' or 'zfs mount POOLbkmn-sata-1v2m'.
Nov 14 14:38:25 bakemono mount[1144]: See zfs(8) for more information.
Nov 14 14:38:25 bakemono systemd[1]: POOLbkmn\x2dsata\x2d1v2m.mount mount process exited, code=exited status=1
Posting a new comment because all that mountpoint jazz turns out to be a diversion from the main issue. I rebooted into a new kernel and decided to look into dmesg (why I didn't think of this earlier, I don't know) and found something interesting:
[ 2.487887] clocksource: Switched to clocksource tsc
[ 7.075273] nvme0n1: p1 p2 p3
[ 18.660568] random: nonblocking pool is initialized
Which indicates that NVMe is initialized almost five seconds after almost everything else during that boot has finished. Of course, when I start a rescue shell and import that pool manually, everything looks fine since five seconds is not that much time - for a human.
I found a commit in the kernel git repository that may be pertinent: https://github.com/torvalds/linux/commit/2e1d8448196ba85cd78a18723413a3c92aabe0f3
This commit has been in the kernel since 4.0-rc1, which lines up with when my problems started.
Sounds like we need to wait for the NVMe driver to finish enumerating all devices before starting the zpool import. This is not a unique problem to NixOS, other distros have needed to resolve this.
The ordering can be done with systemd dependencies.
I'll research how to have the NVMe driver raise an objection while it is enumerating, and lower it when the devices are available.
All of this happens during stage 1, and AFAICS systemd does not yet run then. The stage 1 shell script could simply wait in a loop until the pool containing the root FS became available. A ten second timeout wouldn't hurt, either. The latter seems to be what other distributions do when mount
cannot proceed.
In an ideal world, one could wait for the relevant block device's device driver module to signal when it has completely finished probing. It may even be possible for NVMe, but I suspect that it's not possible to do that in the general case because from udev
's point of view devices can pop up and vanish all the time, making the notion of "finishing all processing" meaningless.
Another thought. If mounting the root (Z)FS was managed through ZFS (zfs set mountpoint=/ ${SOMEZFS}
), as opposed to what is done currently with legacy
mountpoints, one could probably simply wait for /mnt-root
's mount to pop into existence, as ZFS does its own event processing.
EDIT But I suspect this may not be as easy since the mountpoint shifts between stages (/mnt-root
vs. /
).
EDIT I also found two interesting NixOS tidbits surrounding the core issue of slowly initialized root FS block devices:
exec /init
to simply restart stage 1 from the beginning. No idea whether that's still safe.zpool import
but there is an opportunity for reuse and/or refactoring.Hrm. Is there no udev settle
in stage 1?
/me wanders off to look.
There is a udevadm settle
in the stage 1. I'll harass the NVMe driver maintains about what's supposed to happen here.
If that fails, we might hack a workaround and inject a sleep 10
from the zfs.nix
module.
The fix recommended by the NVMe driver maintainers is to pass udev.event-timeout=120 rd.udev.event-timeout=120
on the kernel command line, which means that zfs.nix
should probably inject that.
I'm not familiar with using zfs with nvme, so I might be off base here. If the "udev" params are not helping (which I thought were for systemd), try adding "rootdelay=120" instead.
Thanks for the suggestions. I tried booting a new kernel with the first suggested set of arguments, then with the second singular parameter. Unfortunately, neither of these has any observable effect - I didn't see any delay during boot, and, of course, NVMe still added the needed device nodes way after zpool import
was finished.
Is the first set really intended for systemd
?
EDIT Looks like at least one of the parameters is being read here: https://github.com/systemd/systemd/blob/master/src/udev/udevd.c#L1360
EDIT My kernel command line does not even define root=
. Is rootdelay=
even effective in such a situation?
It's been a while since I hacked on the udev boot stuff, but udevadm trigger
is supposed to get the device drivers to emit events, and then udevadm settle
waits for all of the events to clear.
I wonder if the --trigger=add
is right?
Let's raise this blocker from the dead:
According to https://github.com/zfsonlinux/zfs/issues/2199#issuecomment-38308916, udevadm settle
does not really wait for uevent
s to have finished processing even though that's exactly what it says on the tin. I'm not 100% sure, but since linux>=4.0
NVMe seems to show the same insidious behaviour USB has been exhibiting for years in that they both indicate everything's done, only to pop up device nodes seconds later. Solutions to that seem to boil down to two basic approaches:
zpool import
in a loop until the root filesystem appears or some reasonable timeout (e.g., 30 seconds) elapses. Waiting for one second between attempts doesn't look very objectionable.As https://github.com/zfsonlinux/zfs/issues/330 implies, this would actually need to wait for all constituent devices of a zpool to show up.
I really don't know what kernel devs think userspace is supposed to do in this case.
A variation on this problem (this time, with USB) is discussed in https://bugs.archlinux.org/task/11571. They go for the two workarounds described above. Here's where I first read that udevadm settle
doesn't really do what one'd expect.
Heh, I guess we can't please everyone. :)
The driver used to complete device discovery serially, blocking initialization until complete. Then user space software decided to kill the init process because it takes "too long" with enough devices present. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1191726 for the user space devs to explain why, and the conclusion was the kernel driver needs to discover NVMe faster. The only way to go faster is to parallelize, so the driver completes device "probe" before the storage is surfaced so that it can move on to the next controller.
You say USB has similar behavior and problems, but I assume other storage device under the SCSI stack works correctly. Not sure if/how they're synchronizing with user space in some way, but will look into it.
Looks like SCSI uses an async schedule where NVMe uses a work queue. I'm guessing that's the key difference based on how "wait_for_device_probe()" synchronizes.
@evujumenuk what's the status of this?
Well, since no one seems to know (or maybe, no one is inclined to tell) how userspace is supposed to wait for the kernel to finish probing all devices, the status is exactly as it was in November. I can try nixos-rebuild --upgrade
ing and rebooting the machine in question at the next opportunity, but with exactly one kernel release since the last time I looked, and the changelog of that release not containing anything like 'undid that horrible NVMe async probe thing that we shall never again speak of' or somesuch ;), I'm not too hopeful that anything has changed, really. There's still the possibility that udev does the right thing, now, but I didn't find anything to that effect in the systemd NEWS, either.
The machine continues to run 3.18.x, for now.
After spending a few more brain cycles on this problem, I think that a flag to zpool import
that made it block until the devices underlying the pool to be import
ed all become available would have resolved it nicely. Unfortunately, such a flag apparently does not exist, and, due to the intended semantics of import
ing, is unlikely to ever be implemented; import
and export
are meant to persistently {associate a pool with,dissociate a pool from} the system. The way we re-import
pools on every startup would constitute a kind of "off-label use" of that functionality. Other distributions just autoimport everything at boot and give the users a few knobs to tweak the eligible list of pools.
In an ideal world, the init script would simply block until the root filesystem magically appeared, because zevent
would receive udev
events and automatically import a pool once its constituent devices showed up, subject to administratively defined {black,white}lists. However, a feature like that is only slated to be implemented for zfsonlinux 0.8.0 or even later.
zfs
in nixpkgs could block importing at startup until a (configurable) list of vdevs has appeared completely. The administrator would need to keep this list updated across pool changes, which makes this approach a bit hacky. However, startup would not be delayed beyond what's necessary, and as long as the list is accurate, import
ing would always work. By default, the list of waited-on devices would be empty -- and behavior for existing users unchanged, as a result.
Thoughts?
I have the same issue right now.
I don't have an issue on 3 physical machines (laptop, pc). I have issues inside VM on XEN.
If I boot machine and pass kernel param `boot.shell_on_fail
and login, wait a moment and than "ctrl+d" it will load the system properly. So this might be some async/race condition here as mentioned.
Same issue here. I can't add very much, but here's the workaround I'm using:
boot.initrd.postDeviceCommands = ''
echo -n 'Waiting for NVMe'
while [ ! -e /dev/nvme0n1 ]; do
sleep 0.1
echo -n .
done
echo
zpool import -a
'';
There's no hook in-between udevadm settle and zpool import, otherwise that last line wouldn't be needed. By all appearances, it takes 3-5 seconds for nvme0n1 to appear after nvme0 appears.
A slightly more principled solution would be to retry the import until it works, with 0.1s in between or so. It's not like there's any point in continuing if root isn't there.
...sorry about the spam!
Since #16901 has been merged and backported to the stable release (thanks @Baughn!), this problem should no longer happen in most cases, so I'm closing. If you continue to experience this issue after updating, feel free to reopen (but keep in mind that it may take a few days until the channels are updated to include the fix).
When using any kernel strictly newer than
pkgs.linuxPackages_3_19
asboot.kernelPackages
, the entire configuration fails to mount the root-on-ZFS filesystem during stage 1.Here is a transcript of a failing boot:
(Yes, I use Hungarian notation for pool and volume names. Stop judging!)
Pressing the
R
andEnter
keys results in a few dozen newlines being printed and a subsequent reboot. Simply pressing, e.g., justEnter
results in the following:I guess switch_root is being called with too few arguments.
I can boot just fine with 3.19.x and 3.18.x. I have reverted to using 3.18.x for now since 3.19.x has fallen out of nixpkgs.
I'd love to include the generated
/etc/fstab
, however, one of the disadvantages to 3.18.x prevents me from feeding any input to that machine's console. Right now, I am posting this from a virtual machine on that host that's had one of the host's USB controllers passed through to it via VFIO. The one which is still assigned to the host proper has had its driver die:Which simply means that I'll need to reboot the entire machine to do something like
cat /etc/fstab
. (sigh)So, yeah. Newer kernels won't mount the root (Z)FS whereas older ones just do it :)
NixOS unstable channel revision
7ae05ed
is being used.