NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.48k stars 12.98k forks source link

Missing preLVM LUKS devices should not be a fatal error in initrd stage1 #322295

Open dfoxfranke opened 1 week ago

dfoxfranke commented 1 week ago

My system's storage is configured as dm-raid on LVM on LUKS. Each of my physical disks contains a LUKS partition whose plaintext is an LVM PV. All of these PVs are collected into a single VG, and then I use LVM RAID to create LVs (including one for the root filesystem) that span all PVs using RAID6. My configuration.nix looks like this:

boot.initrd.kernelModules = [ "dm-snapshot" "dm-raid" ];

boot.initrd.luks.devices = {
  "crypt0" = {
    device = "/dev/disk/by-uuid/57306d20-0cea-47f4-a5e4-7a51737829cf";
    preLVM = true;
    allowDiscards = true;
  };
  "crypt1" = {
    device = "/dev/disk/by-uuid/b50f7a53-c56c-4050-b093-8da80be683df";
    preLVM = true;
    allowDiscards = true;
  };

  # ... and five more stanzas similar to the above for crypt2 through crypt6.
};

fileSystems."/" = { 
  device = "/dev/disk/by-uuid/ecbaded6-6118-455d-b48a-24a493dc6631";
  fsType = "ext4";
};

fileSystems."/boot" = { 
  device = "/dev/disk/by-uuid/B2FC-DB33";
  fsType = "vfat";
};

Recently one of my drives was throwing errors in dmesg, so I shut the system down, removed that drive, replaced it with an unformatted spare, and powered the system back on, expecting to be able to boot from the degraded array and then initialize the replacement drive and add it to the array. Instead, I found that the initrd script refused to continue past stage 1 on account of the missing disk. In order to unwedge my system, I had to boot from rescue media, repair the RAID array, edit my configuration.nix to update it to the new LUKS volume's UUID, and rebuild the initrd. None of this should have been necessary, since if the initrd script had simply continued past the error, it could have mounted the root volume without any trouble.

Metadata

MatthewCash commented 1 week ago

Have you tried adding nofail to your luks device's crypttabExtraOpts?

dfoxfranke commented 1 week ago

I didn't know until just now that crypttabExtraOpts existed, because it's declared with visible = false so it's excluded from documentation. But anyway, that won't work, because the init script is running commands such as

wait_target "device" /dev/disk/by-uuid/b50f7a53-c56c-4050-b093-8da80be683df || die "/dev/disk/by-uuid/b50f7a53-c56c-4050-b093-8da80be683df is unavailable"

to check that the device it's trying to luksOpen exists before it ever gets to invoking cryptsetup. Also, nofail is designed to never block and is only compatible with keyfiles, not with manual password entry; see https://github.com/systemd/systemd/issues/27321.