dasJ / sd-zfs

Compatibility between systemd and ZFS roots
MIT License
43 stars 13 forks source link

Zpool import first fails then succeeds after typing Ctrl + D #26

Open hadrienk opened 6 years ago

hadrienk commented 6 years ago

Hi, thank for sharing your work. I am trying to create a minimal initrd. I configured the hooks as follow:

HOOKS=(base udev block systemd sd-plymouth autodetect modconf keyboard keymap sd-zfs)

I am using refind

menuentry "Arch Linux (ck-surface4)" {
    icon     /EFI/refind/icons/os_arch.png
    loader   vmlinuz-linux-ck-surface4
    options  "initrd=intel-ucode.img initrd=initramfs-linux-ck-surface4-minimal.img rw root=zfs:zroot/root/default zfs_wait=30"
    submenuentry "Boot using default initramfs" {
        initrd initramfs-linux-ck-surface4.img
    }
    submenuentry "Boot using fallback initramfs" {
        initrd initramfs-linux-ck-surface4-fallback.img
        add_options "break=postmount"
    }
    submenuentry "Boot to terminal" {
        add_options "systemd.unit=multi-user.target"
    }
}

When booting the zpool import first fails. When I type Ctrl + D it seems it tries again and starts normally. Any idea what I did wrong?

dasJ commented 6 years ago

Are there any relevant systemd messages around it? You should be able to see them from your running system with journalctl -b

kerberizer commented 6 years ago

I'm seeing the same issue on one system. zpool complains about "no such pool or dataset", but it does succeed importing the pool when the zfs-import-cache service is run from the shell after Ctrl+D. I suspect a timing problem, probably related to #25: perhaps the devices are not yet properly initialized when the import cache service is run for the first time. It's an important system, so unfortunately I can't make experiments at will, but if I have new information, I'll report it.

kerberizer commented 6 years ago

The logs seem to confirm my suspicions:

Sep 06 15:17:09 archlinux systemd[1]: Started udev Wait for Complete Device Initialization.
Sep 06 15:17:09 archlinux systemd[1]: Reached target System Initialization.
Sep 06 15:17:09 archlinux systemd[1]: Reached target Basic System.
Sep 06 15:17:09 archlinux systemd[1]: System is tainted: var-run-bad
Sep 06 15:17:09 archlinux systemd[1]: Starting Import ZFS pools by cache file...
Sep 06 15:17:09 archlinux kernel: spl: loading out-of-tree module taints kernel.
Sep 06 15:17:09 archlinux kernel: icp: module license 'CDDL' taints kernel.
Sep 06 15:17:09 archlinux kernel: Disabling lock debugging due to kernel taint
Sep 06 15:17:09 archlinux kernel: usb 2-12: new high-speed USB device number 2 using xhci_hcd
Sep 06 15:17:09 archlinux kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: usb 1-1: new high-speed USB device number 2 using ehci-pci
Sep 06 15:17:09 archlinux kernel: usb 4-1: new high-speed USB device number 2 using ehci-pci
Sep 06 15:17:09 archlinux kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata10: SATA link down (SStatus 0 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata9: SATA link down (SStatus 0 SControl 300)
Sep 06 15:17:09 archlinux kernel: ata2.00: NCQ Send/Recv Log not supported
(...snip...)
Sep 06 15:17:11 archlinux kernel: ZFS: Loaded module v0.7.0-1551_gcc99f275a, ZFS pool version 5000, ZFS filesystem version 5
Sep 06 15:17:11 archlinux kernel: random: crng init done
Sep 06 15:17:11 archlinux kernel: random: 7 urandom warning(s) missed due to ratelimiting
Sep 06 15:17:11 archlinux zpool[281]: cannot import '<redacted>': no such pool or dataset
Sep 06 15:17:11 archlinux zpool[281]:         Destroy and re-create the pool from
Sep 06 15:17:11 archlinux zpool[281]:         a backup source.
Sep 06 15:17:11 archlinux systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Sep 06 15:17:11 archlinux systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Sep 06 15:17:11 archlinux systemd[1]: Failed to start Import ZFS pools by cache file.

Apparently a lot of device initialization happens after udevadm settle on this particular system.

kerberizer commented 6 years ago

@dasJ I can confirm being able to avoid the issue by inserting an appropriate delay before pool import. My test solution was rather crude: if the first import would fail, it would sleep 2 seconds, then try again and sleep another 4 seconds on failure before trying one last time. I'm afraid I don't know right now what would be the most elegant and efficient approach. In any case, the ability to configure a delay before the pool import—possibly via a kernel parameter—may at least be a reasonable interim solution.

kerberizer commented 6 years ago

I've also encountered the issue on another system, but can't tell yet what might be different about those problematic systems. The same solution with inserting a delay at least did work.

Klowner commented 5 years ago

@kerberizer I realize it's been a year, but would you be willing to share the modifications you made to introduce the delay? I'm having a heck of a time booting a system with a zpool on a USB device and it appears to be entirely a timing issue.

kerberizer commented 5 years ago

@Klowner No problem sharing at all, but I need to recall myself what were those changes; it appears that at some point of time I've removed them. Off the top of my head I'd suggest probably editing zfs-import-cache.service (or -scan if not using zpool.cache), replacing the /usr/bin/zpool import in ExecStart with something like /usr/bin/sh -c "zpool import ... || sleep N && zpool import ... || sleep N ... The point is to retry the pool import after some time if it fails, hoping that the devices would have time to settle in the meantime.

More robust solution may be unnecessary, as Arch Linux may at some point ditch initcpio, replacing it with dracut—or at least that was my impression from some emails on the arch-dev-public mailing list.