NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.33k stars 14.3k forks source link

Boot fails when importing a ZFS pool of LVM volumes with `Failed to find executable lvm: No such file or directory` #165032

Open jaen opened 2 years ago

jaen commented 2 years ago

Describe the bug

From some time (might coincide with lvm2 upgrade to 2.03.15, but haven't checked yet) one of my servers has problems booting – it gets stuck at ZFS import. After some debugging I have pinpointed the issue to be caused by LVM – if I go into the rescue shell and do vgchange -ay && exit the system boots successfully. When looking at verbose boot logs I can see the following message:

Mar 20 23:05:53 ronin lvm[4242]: PV /dev/dm-12 online, VG ThunderbayCacheVolumeGroup is complete.

... quite a bit of udev stuff

Mar 20 23:05:53 ronin systemd-udevd[4164]: dm-12: Running command "/run/current-system/systemd/bin/systemd-run -r --no-block --property DefaultDependencies=no --unit lvm-activate-ThunderbayCacheVolumeGroup lvm vgchange -aay --nohints ThunderbayCacheVolumeGroup"
Mar 20 23:05:53 ronin systemd-udevd[4164]: dm-12: Starting '/run/current-system/systemd/bin/systemd-run -r --no-block --property DefaultDependencies=no --unit lvm-activate-ThunderbayCacheVolumeGroup lvm vgchange -aay --nohints ThunderbayCacheVolumeGroup'
Mar 20 23:05:53 ronin systemd-udevd[4164]: Successfully forked off '(spawn)' as PID 4245.
Mar 20 23:05:53 ronin systemd-udevd[4164]: dm-12: '/run/current-system/systemd/bin/systemd-run -r --no-block --property DefaultDependencies=no --unit lvm-activate-ThunderbayCacheVolumeGroup lvm vgchange -aay --nohints ThunderbayCacheVolumeGroup'(err) 'Failed to find executable lvm: No such file or directory'
Mar 20 23:05:53 ronin systemd-udevd[4164]: dm-12: Process '/run/current-system/systemd/bin/systemd-run -r --no-block --property DefaultDependencies=no --unit lvm-activate-ThunderbayCacheVolumeGroup lvm vgchange -aay --nohints ThunderbayCacheVolumeGroup' failed with exit code 1.
Mar 20 23:05:53 ronin systemd-udevd[4164]: dm-12: Command "/run/current-system/systemd/bin/systemd-run -r --no-block --property DefaultDependencies=no --unit lvm-activate-ThunderbayCacheVolumeGroup lvm vgchange -aay --nohints ThunderbayCacheVolumeGroup" returned 1 (error), ignoring.

As you can see, udev fails to activate the volume group because the lvm seems to be missing – I can however call vgchange without problems from the rescue shell, so it might not be as simple as the binary not being available.

If it helps, the machine disk layout is as follow:

Where LVM comes into play here is that the slog/L2ARC SSD is partitioned via LVM into three devices 2 for mirrored slogs and 1 for L2ARC (let's disregard whether it's a smart thing to do, I have done it mostly to learn how to set it up and had only one SSD to spare). This setup has worked without a hitch for a while, but has started breaking from some time.

Let me know what parts of my configuration would you like me to provide or any other information that would help debugging this.

Steps To Reproduce

Good question… I imagine it would be something like this:

  1. have ZFS pool inside LUKS devices decrypted via crypttab,
  2. with at least one of the devices partitioned via LVM
  3. try to import the pool at boot
  4. fail due to the LVM volumes failing to activate

But can't really test what is a minimal reproduction of this issue without having to restart server a lot and staying sane.

Expected behavior

The system should keep booting without a hitch, properly activating LVM volumes before ZFS pool import

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

╭─jaen@ronin ~
╰─$ nix-shell -p nix-info --run "nix-info -m"
this path will be fetched (0.00 MiB download, 0.00 MiB unpacked):
  /nix/store/rmhwi0jcya8f87gzk2jjdwv4hifmmbb4-nix-info
copying path '/nix/store/rmhwi0jcya8f87gzk2jjdwv4hifmmbb4-nix-info' from 'https://cache.nixos.org'...
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.28, NixOS, 22.05 (Quokka)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.8.0pre20220311_d532269`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
jaen commented 2 years ago

Generation from 04.01.2022 boots without problems, I'll see if I have made any configuration change that's rlevent since, but I don't think I've touched the setup of my pools for a while.

jaen commented 2 years ago

This seems to not be happening anymore, maybe this fixed the issue: https://github.com/NixOS/nixpkgs/pull/168302/files#diff-3c570009a2c7ab89324c8b85e157992451fecd47f332e51c4dc8d4351b7c1540R44 Can anyone confirm this would be a possible fix?