Question on building the "-part1" partition

reefland commented 1 year ago

I'm reviewing your script and trying to understand in a multi-disk setup, how are you populating the -part1 partition on each disk?

I can see within systemsetupFunc_part3():

mkdosfs -F 32 -s 1 -n EFI /dev/disk/by-id/"$DISKID"-part1 
    sleep 2
    blkid_part1=""
    blkid_part1="$(blkid -s UUID -o value /dev/disk/by-id/"${DISKID}"-part1)"
    echo "$blkid_part1"

I don't see where this is in a loop that would update the -part1 on each disk being used.

Within the initialinstall() the systemsetupFunc_part3 #Format EFI partition. is only called once, I don't see a loop.

It seem like when the one drive added to fstab dies / is removed from the system, the system will be unbootable as the the other disks do not have a populated -part1 partition and they are not referenced within fstab.

Nice work, clean script.

Sithuk commented 1 year ago

Well spotted. You are correct I need to add code to create the EFI filesystem on the other disks. I also need to understand how to properly setup zfsbootmenu with multiple ESPs.

I had a look at rlaager's root on zfs guide for this step. He recommends repeating the mkdosfs command for each disk, but not the other commands, including the command that creates the fstab entry (see step 7 in his guide). https://openzfs.github.io/openzfs-docs/Getting%20Started/Ubuntu/Ubuntu%2022.04%20Root%20on%20ZFS.html

However, that still leaves the question of what happens if you lose the "part1" disk. I’ll need to do some testing around that scenario.

Thank you for highlighting the bug. Please let me know if you see any more.

reefland commented 1 year ago

I've been using the OpenZFS on Root method for a few years now. For 20.04, after building the 1st one, at the end I used dd to clone it to each of the other -part1 partitions. For example, a current boot mirror I have:

$ cat /etc/fstab 
/dev/disk/by-uuid/3C8A-3226 /boot/efi vfat defaults 0 0
/dev/disk/by-uuid/27B1-B239 /boot/efi2 vfat defaults 0 0

And this works well, I've swapped out failed devices and done upgrades without issue. I recreate the partition sizes to match on the new device, dd the new -part1 or just do the manual commands. Remove the old entry in fstab and add in the replacement.

My older storage server with more drives, looks this way:

$ cat /etc/fstab 
UUID=6F65-2805 /boot/efi vfat umask=0022,fmask=0022,dmask=0022 0
UUID=3608-D8FF /boot/efi2 vfat umask=0022,fmask=0022,dmask=0022 0
UUID=2FF8-DA88 /boot/efi3 vfat umask=0022,fmask=0022,dmask=0022 0
UUID=AB06-2CE8 /boot/efi4 vfat umask=0022,fmask=0022,dmask=0022 0

I'm redesigning my ZFS on Root Ansible work to use ZFSbootmenu / refind alternatives and trying to find solutions for the various stages. I don't know if /boot/efi is fairly static with zfsbootmenu that a dd still works or if a better method is used. Thus I checked your script to see how you tackled it.

Being that the mdadm is already used for SWAP, I was thinking of making all the -part1 a mdadm mirror. Considering this is not already a standard practice even though mdadm is used for SWAP, I have to assume there is a reason, but not figured out what that reason is yet.

Sithuk commented 1 year ago

I've created a thread upstream and the ZBM authors have engaged in the discussion. https://github.com/zbm-dev/zfsbootmenu/discussions/363

reefland commented 1 year ago

Thanks for the link, so its seems mdadm can be used... but I still like the user contrib script / userland tools for simplicity.

I've had mdadm issues where it refused to rebuild the array but the impact was only no swap. Still a bootable and usable system you could troubleshoot with. I wouldn't want that fight on my boot volume and try to fix that with emergency rescue tools. But that is just my lack of mdadm skills talking.

Sithuk commented 1 year ago

I tend to agree. The simpler user land tooling of the contrib script should leave less to go wrong. I need to test to see if refind will pick up all the ESPs automatically. I also need to test what happens on loss of a drive when zbm boots the system.

reefland commented 1 year ago

Still foggy what to do with the fstab file when using the contrib script (if anything). The system would have to already found and loaded the pool, decrypted and mounted the filesystems before fstab would even be available. Seems like its not relevant after the initial chroot system build. Hmmm.

Edit... comment in the script:

## Only files with `zfsbootmenu` in the name are copied to / deleted from
## the backup ESPs. It will not manage any other files for you.

Then directories for refind, syslinux, etc would not be copied to backup ESPs.

Sithuk commented 1 year ago

The rsync command in the contrib script could be altered to sync the full contents. I'm still not clear on whether the simpler contrib script approach will allow everything to fully work as it should without having to amend file contents such as /etc/fstab. There is no substitute for testing I suppose.

Sithuk commented 1 year ago

The arch wiki has an excellent section with helpful links to supporting information on a RAID ESP. It also includes an mdraid command to create the array. https://wiki.archlinux.org/title/EFI_system_partition#ESP_on_software_RAID1

One of the background links in the arch wiki section above was to a forum post where the poster noted that rEFInd can create a screenshot which could corrupt an mdraid array ESP. The corruption is because rEFInd only saves the screenshot to one of the ESPs. The other ESPs are then out of sync?

"You run the risk of data corruption when the EFI, or anything else that doesn't treat the ESPs as a Linux RAID 1, writes to just one of the ESPs. This could happen if you use an EFI shell's text editor to edit a boot loader configuration file or if you save a screen shot from rEFIt or rEFInd, to name just two possible ways this could happen. There might even be a risk of problems if a file's time stamp were modified, although I don't know if this could become serious." https://bbs.archlinux.org/viewtopic.php?pid=1398710#p1398710

According to the arch wiki page for rEFInd, pressing F10 in rEFInd will save a screenshot to the top level directory of the ESP (a single drive ESP as rEFInd doesn't support RAID). So if someone were to inadvertently press F10 during while the rEFInd screen is shown, then the mdraid ESP RAID array would become corrupted?

Interestingly, Fedora uses the RAID 1 approach for the ESP when a mirror configuration is selected in its installer, according to comment #4 on the following blog post (linked from the arch wiki page above). I think Fedora uses GRUB2 as its bootloader so perhaps the corruption risk is lower? https://outflux.net/blog/archives/2018/04/19/uefi-booting-and-raid1/ "FWIW, when selecting a mirror configuration in the Fedora installer it also creates the ESP on a RAID-1 (with superblock format 1.0) and sets the GPT partition type UUID to ‘Linux RAID’. The Fedora efibootmgr works fine in such a configuration."

I like the auto sync nature of the mdraid approach but am concerned by the potential corruption issue, especially as the ubuntu zfsbootmenu install script currently uses rEFInd as boot loader. I suppose I could work around that by swapping out rEFInd for systemd-boot. There is a guide to setting up ESP RAID at the following forum thread. https://bbs.archlinux.org/viewtopic.php?id=177722

But there is still the risk that the ESP gets written to at boot by some particular hardware firmware implementation.

So I'm leaning towards following the contrib script approach to have zfsbootmenu update the multiple ESPs. rEFInd should pick up and display the multiple ESPs at boot. So if a drive fails it should default to booting off one of the other ESPs. I need to test what happens to Ubuntu when the fstab tries to mount the failed drive ESP.

reefland commented 1 year ago

I just rebuilt with the 3x mdadm mirror and it seems to be working fine. Tried the partition UUID for EFI and Linux Raid didn't seem to make a difference. I tried to corrupt it in refind by pressing F10, but nothing happens. I don't see a screen that even says F10 does anything. Perhaps some option I have not enabled?

Sithuk commented 1 year ago

I've tested removing a disk from my virtual machine and VirtualBox doesn't always boot into one of the other backup ESPs. I get dropped into an EFI shell. From there I can manually start one of the backup refind installations. I'm not sure why Virtualbox isn't starting one of the other refinds by default. Have you had any issues with one of the backup refind installations being detected after you remove a disk?

I've also noticed that the mirror rpool no longer automatically mounts. The pool becomes degraded when the disk is removed. I need to "zpool import -f rpool" from the zfsbootmenu prompt. Then I can start the system as usual. Although I get errors associated with the mdraid swap array now that a disk is missing.

Have you had these issues too? The important thing is the system is recoverable on loss of a disk, but it would be nicer if it was a failover solution where no user input was required to get the system to start as usual.

reefland commented 1 year ago

I haven't tested that yet. I will. I'm still trying to get this to work on my stubborn hardware in a repeatable way and get it to load refind vs ZBM directly.

Sithuk commented 1 year ago

I should mention that I haven't used an mdraid array for the ESPs. I've kept them standalone.

reefland commented 1 year ago

So I just yanked one drive from VirtualBox... Refind screen correctly showed 2 ESPs instead of 3 in my case (it doesn't know they are a RAID so sees the 3 mirror as individuals). You have to wait 3 to 4 minutes for devices to time out, but you get some messages on the screen:

But a login screen did come up, ZFS saw the disk issue:

The mdadm was aware as well:

Powered off and added the disk back to VirtualBox, Refind showed 3 ESPs.

mdadm still took a while to wait for before I got a login prompt.

ZFS Resilvered:

mdadm needed some manual work to add the drive back, but once done the array was clean:

reefland commented 1 year ago

Hmm... even after next reboot, still took a long time for the login prompt to come up... a minute or so. Not sure what that is about. But system is functional once its up.

Sithuk commented 1 year ago

I tested with a 3 way mirror, not raidz1. Could you include a 3 way mirror in your test for comparison?

Could you also try removing a different disk for your test? I'm wondering if removing one of the "backup" ESP disks doesn't cause an issue, but removing the primary ESP disk gives an issue in Virtualbox at boot.

reefland commented 1 year ago

Sure, I won't be able to test that for another week or so.

Sithuk commented 1 year ago

I've updated the script to support backup ESPs. Could you test the new feature?

reefland commented 1 year ago

I've updated the script to support backup ESPs. Could you test the new feature?

Just tested, it installed and booted fine.

I decided on mdadm arrary for ESP for my script. Do you plan to include a process to keep the ESPs in sync? After a few reboots I'm already seeing differences:

I removed disk 0 from VirtualBox and it was unable to boot. Just left me in the EFI Shell. I was able to use the shell to navigate to the other disk and start refind without a problem but it was unable to import the pool.

Using ZBM recovery shell trying to import manually, message said the pool was previously used by another system. I was about to use zpool import -f -a to import it and exit to return to ZBM and enter my encryption key; but it just seemed to hang at this point with VirtualBox logo for several minutes.

After about 3 to 4 minutes got some messages, seems it was a timeout waiting on /dev/md0.

I usually do not have swap partitions enabled in my setups.

It then dropped me into linux emergency mode / maintenance shell. Not sure what the issue was, really could not get a normal boot. I suspect I needed to delete a device from the md0 swap array.

reefland commented 1 year ago

In contrast, my setup using the mdadm array for ESP. Virtualbox had no problem just booting from another disk when I removed disk 0. Refind fired right up.

I still had a significant delay after ZBM / Entering passphrase. I suspect from mdadm waiting on the missing array device.

But it still booted up cleanup and I got the normal login prompt.

Sithuk commented 1 year ago

Just tested, it installed and booted fine.

That's good news, thank you for testing.

Do you plan to include a process to keep the ESPs in sync?

I've added a short rsync script that runs when generate-zbm is run. The script should be in the following location. /etc/zfsbootmenu/generate-zbm.post.d/

You are correct that the ESPs can drift out of sync. Boot loaders can save info such as log information in the ESP. That won't be sync'd to the backup ESPs. That won't stop those backup ESPs from working though.

I removed disk 0 from VirtualBox and it was unable to boot. Just left me in the EFI Shell. I was able to use the shell to navigate to the other disk and start refind without a problem but it was unable to import the pool.

In my similar "disk failure" testing, sometimes Virtualbox started a backup rEFInd ok, and sometimes it dumped me to the EFI shell. I was always able to use the shell to start the backup refind efi file too. I wasn't able to figure out why Virtualbox sometimes booted the backup and why sometimes it didn't. Virtualbox's EFI implementation is described as experimental.

"Note that the Oracle VM VirtualBox EFI support is experimental and will be enhanced as EFI matures and becomes more widespread." https://docs.oracle.com/en/virtualization/virtualbox/6.0/user/efi.html

Please let me know the results if you do any real hardware testing of the backup ESP configuration. Or test using other virtual machine software.

Using ZBM recovery shell trying to import manually, message said the pool was previously used by another system. I was about to use zpool import -f -a to import it and exit to return to ZBM and enter my encryption key; but it just seemed to hang at this point with VirtualBox logo for several minutes.

I was always able to use the ZBM recovery shell to "zpool import -f rpool" successfully. I tested this over the remote access at boot too. So the system should allow for headless recovery even with a disk failure.

After about 3 to 4 minutes got some messages, seems it was a timeout waiting on /dev/md0.

I got the same. Ubuntu eventually starts after the timeout. You are correct that the user can then remove the failed drive from the swap array.

In contrast, my setup using the mdadm array for ESP. Virtualbox had no problem just booting from another disk when I removed disk 0. Refind fired right up.

I understand that fedora uses mdraid for ESP on multi disk setups. I was curious how fedora resolved the risks of writes to the ESP given that the UEFI spec doesn't support it. The following thread doesn't inspire confidence in the approach. https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/QTRRMZU6PORBTGSATUFUIEUKY7QYA3PV/ Extracts below are from posts by "Chris Murphy" on that thread: "Writing outside of md won't itself cause corruption. It just causes the ESPs to mismatch. The corruption will happen upon md having no way of resolving the mismatches, and feeding conflicting file system metadata to the vfat driver. These reads, without any write, can be corrupt, let alone subsequent writes back to the array, which for sure will corrupt both member devices in such a case.

Ostensibly the firmware should only be doing reads, but the UEFI spec does permit writes by the firmware."

"Anytime one of the member devices is written to outside of the array, md raid1 more or less randomly chooses which block is correct. So you can actually end up with a totally unbootable system with both (fake) ESPs totally corrupted by the sync process, where left alone unsynced, both are valid working ESPs even if one might be stale."

Lennart Poettering's recent blog post on this topic suggests that the systemd is thinking of some changes to the booting process. As a next step for the script, I might look at porting over to systemd-boot instead of rEFInd. https://0pointer.net/blog/linux-boot-partitions.html (see "Addendum: You got RAID" for the relevant section on multi ESPs) The following project might also of interest to keep backup ESPs in sync as it uses the systemd service file approach. https://github.com/gregory-lee-bartholomew/bootsync

In summary, the install script should work in its present form. The backup ESPs may get out of sync, but should still work, i.e. they shouldn't be "corrupted". They should also get re-sync'd when generate-zbm is re-run. The whole setup should still work on a headless system. What is really needed is for further testing to be done, including on real hardware. Next steps will be to see how the systemd team develop the multi ESP support and to then align with it, probably with systemd-boot.

Sithuk / ubuntu-server-zfsbootmenu

Question on building the "-part1" partition #12