Support for reconfiguring the root storage

ajeddeloh commented 5 years ago

FCOS will have a static, server-side generated initramfs. This means the initramfs needs to know how to find the root filesystem to mount it, including if that means starting RAID devices. However, we do NOT want to start other raid devices (unless Ignition is starting them as part of the creation process) since users might be writing out configuration to things like /etc/mdadm.conf that they want to use.

On CL we require that RAIDed devices be partitions with the a special GPT partition type guid and have a udev rule to start only those.

For FCOS I can see two options:

Do something similar to what CL did. This has the limitation of being unable to use whole disks for raid and requires partition tables where they would not otherwise be needed.
Have some sort of file that gets written out to /boot which is a "rootmap". This would describe the "path to root" including what services like mdadm need to be started. Ignition could generate this if we include a way of flagging partitions/RAIDs as "containing root" or it could be user written. If it is user written we should unmount root after Ignition files/umount and then let whatever tooling reads it mount the root itself to ensure the first boot is not "special" in any way.

dustymabe commented 5 years ago

We discussed this briefly in the meeting today - There are two things I'd like (personal opinion) to do as part of this:

develop a list of cases we'd like to support
- this will allow us to prove out theories about solutions here
prefer not building something new
- of course if we need to we need to, but would like to have strong justification for doing so

Andrew also mentioned that he'd like the solution to be generic enough that if we decide to support new use cases in the future our current solution doesn't prohibit us from doing so. If we build something new, this is reasonable.

cgwalters commented 5 years ago

Lots of discussion around partitioning in https://github.com/coreos/fedora-coreos-tracker/issues/18

Let's repurpose this issue as a tracker for support for reconfiguring the root filesystem in any form; RAID is a subset of that.

Bigger picture, this gets a lot more complex with the combination of FCOS:(ostree, ignition) as opposed to ContainerLinux:(dual partition /usr, ignition), because with CL replacing the rootfs is nearly trivial, but for ostree it is entangled with the OS (which adds flexibility, but does make this problem more complex).

There are two sub-paths to this problem: One where we're not destroying the existing rootfs (e.g. we're just adding a new partition that we want to make into the rootfs), and one where we're intending to replace it.

I think if we're trying to replace the rootfs...we get into an interesting problem domain because we need to pull the ostree repo into RAM temporarily. In most cases...hmm, that should be fine really. But, if we want to optimize things, I could imagine that coreos-installer accepts Ignition, and if a replacement of the root filesystem is requested, we do that at install time as opposed to boot time. We'd then pass the remaining Ignition as /boot/ignition.ign.

And we need support for running ostree in the initramfs - shouldn't be too hard.

The flow would be something like:

ignition-disks.service
ostree-redeploy-rootfs.service
ignition-files.service

ajeddeloh commented 5 years ago

Hadn't considered pulling it into RAM; I like that idea. I was thinking about just pulling over network, but RAM has some obvious speed/bandwidth cost advantages. There's some tricky bits:

Unless we always want to pull the ostree into ram, we need to detect whether we're going to blow away root.
We can't do it if ostree size > available ram, where available ram will be slightly smaller than apparent because we need to hold it in ram while ignition disks runs. We could optionally fall back to using network in this case.
Before ignition disks runs, we don't have a rendered config which means we can't know if we're going to blow away root. This could be solved by splitting config rendering into it's own stage
ignition-disks starts before all devices are ready. We can't look at device fields in the config since those could be using those symlinks.

I could imagine that coreos-installer accepts Ignition, and if a replacement of the root filesystem is requested, we do that at install time as opposed to boot time

This doesn't help on clouds. I'd rather have a unified way. In some ways it would be easier to just do all of disks then, but that's got it's own challenges and deserves it's own ticket.

Proposal:

Split ignition disks into render and disks
Detection could work a couple ways:
- Require that the partition containing root has partlabel=root. Cache to ram if we find any partition entries with "label": "root". We can't (reasonably) wait on udev for the device since if it's on raid or similar it won't become available until after ignition-disks. This generates false positives for cases like resizing root to be bigger.
- Require that the filesystem containing root has label=root. Cache to ram if we find any filesystem entries with "label": "root". I can't think of a reason to include a rootfs definition matching the default anyway.
ignition-[u]mount needs to learn to ignore filesystems with "path": "/" since the initramfs owns mounting those
Still do not support things like split /usr or /etc. This means ostree-redeploy-rootfs needs to happen before ignition mount.

Sequence would look like: 1) ignition-render 1) detect-root-change (needs to understand ignition spec but probably shouldn't be part of the ignition project) 1) ignition disks 1) rootfs mounted 1) ostree-redeploy-rootfs 1) ignition mount 1) ignition files 1) ignition umount 1) switch root

This still leaves open the question of "how do we find/mount root on subsequent boots".

An unrelated question is: do we support blowing away /boot? My gut says no. On x86_64 you can blow away /efi when running on bios (may need to mask the mount unit for it) or BIOS-BOOT when running on EFI. I don't think we can get rid of boot though. It's also partition 1 so it shouldn't get in the way of anything and it's not huge. We probably could find a way to allow moving it but I really don't think it's worth the complexity that would entail.

cgwalters commented 5 years ago

We can't do it if ostree size > available ram, where available ram will be slightly smaller than apparent because we need to hold it in ram while ignition disks runs

Yeah I had the same hesitation but then I realized two things. First, the current FCOS is 1.5G. I think if you're doing something like RAID we can presume you have a "serious" system and have at least say 8G of RAM.

Second, we could enable zram temporarily.

ajeddeloh commented 5 years ago

I realized I initially missed the bit about "what if we're just moving the rootfs and the old one will still be accessible". Do you think it's worth the complexity to implement that? If so I think we ought to do this in two phases, handling the generic case first then treating that as an optimization.

ajeddeloh commented 5 years ago

Second, we could enable zram temporarily.

Neat! I think we ought to handle that second and only enable it when we don't have enough ram to do it without.

cgwalters commented 5 years ago

We can't (reasonably) wait on udev for the device since if it's on raid or similar it won't become available until after ignition-disks. This generates false positives for cases like resizing root to be bigger.

I'm confused...doesn't this stuff imply Ignition running twice? We know the initial rootfs we're booting from isn't on RAID because our default disk image doesn't do that.

ajeddeloh commented 5 years ago

I'm confused...doesn't this stuff imply Ignition running twice? We know the initial rootfs we're booting from isn't on RAID because our default disk image doesn't do that.

I'm not sure why it would imply running Ignition twice? Unless you're referring to getting the config, in which case that's why I'm suggesting we split that into its own stage. I'm saying it's nontrivial to know with 100% certainty that the root is being destroyed since the block device has aliases that we would need to wait for. I suppose we could, but that makes the live pxe case harder since there isn't a device and some aliases aren't known before boot.

Consider if FCOS is installed with /dev/sda4 being root. That can be referenced by /dev/sda4, /dev/disk/by-label/root and /dev/disk/by-partlabel/root (and some others that are system dependent). I'm saying we can't detect if a user is blowing away the partition because:

1) We don't know what disk we're installed on (maybe we should add a udev rule for that? create something like /dev/disk/installed. Seems useful in general and could improve config portability), so if they're wiping the filesystem with device: sda4 we don't know if that is actually the rootfs or not since we might be installed on sdb instead. 2) Those symlinks may not exist yet so if the user is using them we don't know where they point. Things like /dev/disk/by-path/* are going to be dependent on the hardware so we really don't know if they're pointing to the rootfs. We can't wait for them either because while /dev/disk/by-label/root pointing to the rootfs is obvious, something like /dev/disk/by-path/<something> is not. We don't have a way to differentiate those from other symlinks that could appear as a result of ignition-disks. We don't know which ones are safe to wait on and which are not.

Because of that we should be detecting that we're creating a new rootfs instead of deleting the old one. We can either do that with partitions looking for a partlabel (which can have false positives when just resizing) or with the filesystem label (which can have false positives if they create an entry that matches the on-disk FS (see https://github.com/coreos/ignition/blob/master/doc/operator-notes.md#filesystem-reuse-semantics), though I'm not sure why you'd do that).

ajeddeloh commented 5 years ago

We discussed this at the FCOS IRC meeting today, mostly about how to handle the second boot. There's essentially five options:

1) Do what Container Linux does and require partitions containing complex devices (e.g luks volumes, parts of RAID, etc) use GPT type guids to specify what they are. The initramfs can then carry udev rules to start these devices

Devices MUST be GPT partitioned to be used (can't do RAID on the device itself, only partitions)
Can't carry extra configuration for things like mdadm, encryption 2) Do what mutable OS's do and regenerate the initramfs to include configuration to start complex devices
Dracut's default modules can be sloppy (do things like start extra devices we don't need to, see rd.auto)
Generating new initramfs's means creating new bootloader entries
Doesn't require carrying a lot of new code in the initramfs
Open question: would we generate this in the initramfs or in the real root?
Makes debugging harder since we no longer know exactly what's in the initrd 3) Write a config file to /boot that maps out how to find the root device
Requires a new tool to read that config and mount/start/decrypt things (systemd generator would be nice but we'd need to mount /boot which makes it harder and would slow down boot if unused).
Config file would be easy to extend, can handle arbitrary complexity
Isn't portable to distros that don't have a separate boot (though complex root is harder in general in that case) 4) Write kernel command line arguments that specify how to find the root device.
Requires a new tool to read that config and mount/start/decrypt things. Could be a systemd generator
kargs aren't great if the user is doing something complex (e.g. LUKS on RAID on RAID).
kcmdline max length is 4k
kargs are transparent, matches with "traditional" ways of specifying root devices.
kargs are simple to read/implement 5) Do 3, but use an extra initrd instead of writing a file to /boot
All the same points as 3, except:
can write a systemd generator
creating an initrd for that is harder especially if the user writes it by hand (see below)
would need a default "noop" initrd that can be replaced so grub doesn't throw a fit when it can't find it. Not sure if ostree changes would be needed for that.
initrd can just be a file ignition writes to /boot

In cases 2-5 there's also the question of "how do we specify what is needed to start the root disk?" Case 1 it's implied that you specify it with type guids. We could:

Try to infer from the ignition config.
- would require users use well known symlinks to devices (e.g. /dev/md/name instead of /dev/mdXXX)
- would be hard to debug if we inferred wrongly
- not very explicit what's going on
Carry extra fields in FCCL for "contains_root" (not married to that name) which FCCT (case 3, 4, 5) or dracut (case 2) uses to write out the config/kargs or create the new initramfs.
- more explicit, but not too much extra work for the end user
- Doesn't leave much room for things like specifying options for encryption, mdadm, etc
Require the user manually create the config (3), kargs (4), or config initrd (5) and use Ignition to write them.
- Nice since it's explicit. We're not doing anything the use doesn't expect
- Bad since a lot of what they write would be largely redundant.

ajeddeloh commented 5 years ago

Bit of data: looks like the kcmdline size is 2048 on all the arches we care about except s390x where it is 896.

cmurf commented 5 years ago

Carrying over from #18 since it's closed.

Re: dd and GPT: 512n and 512e can share one GPT, 4Kn needs a different GPT. I don't see how they can be merged without invalidating Partition Entry Array CRC32 which is computed on the content between the header and First Useable LBA. And First Useable LBA on 512 byte sector devices must be 34 or higher.

Re: mkfs.xfs uses a 4096 byte sector size on 512e and 4Kn, but uses 512 bytes on 512n.

Those two above combined suggests three raw images, unless something gives.

Also, when dd'd to a different sized drive, the GPT becomes invalid, so that should probably get fixed as a first step.

Each RAID layout should get its own mkfs.xfs for setting stripe width/unit

On 4Kn drives, minimum EFI system partition size is 256MiB if it's FAT32, which the UEFI spec does suggest it should be, sorta. But mkdosfs only starts using FAT32 above something near 550MiB, unless -F 32 is used. fedora-coreos-30.20190905.0-metal.raw has a FAT16 ESP. Obviously it works, and I haven't ever heard of FAT16 not working on anything.

Re: ZRAM dynamically allocates RAM, setting up swap on ZRAM at 1:1 to RAM has inconsequential overhead on a system that doesn't need it, but you could do a test similar to Anaconda's implementation and only start it for low memory devices. I'd like to see this upstream version working and rebase the myriad implementations on it. https://github.com/systemd/zram-generator

cgwalters commented 5 years ago

I'm working on the fundamentals of this in https://github.com/coreos/ignition-dracut/pull/107

First MVP is to support this fcct:

# Example of redeploying the rootfs as ext4
variant: fcos
version: 1.0.0
storage:
  filesystems:
    - device: /dev/disk/by-partlabel/root
      format: ext4
      wipe_filesystem: true
      label: root

Tracker for issues:

[ ] Get jlebon's SELinux/initramfs kernel patch into Fedora kernel
[ ] https://github.com/coreos/coreos-assembler/issues/781
[ ] Preserve SELinux labels (probably involves writing the code in rpm-ostree)
[ ] Documentation of what's supported and what isn't

cgwalters commented 5 years ago

To reiterate on some discussion in that PR, this is to help enable https://github.com/openshift/enhancements/pull/15 without requiring the root filesystem to come in a LUKS container by default (and relatedly, ensure that when a user chooses LUKS it can be encrypted from the start with a TPM2-sealed key, etc.)

darkmuggle commented 5 years ago

Sorry, I didn't mean to close. I have NO idea why "close and comment" is next to the comment button.

cgwalters commented 4 years ago

https://lore.kernel.org/linux-fsdevel/157867221257.17873.1775090991929862549.stgit@devnote2/ proposes appending to the initrd as a way to provide larger kernel configuration data; relevant here.

cgwalters commented 4 years ago

I was talking with a Stratis developer today. One thing worth thinking about is whether it would make sense to potentially focus on having Ignition configure Stratis. That way we get out of the business of doing e.g. raid/lvm/dm-crypt directly ourselves. (He said Stratis is working on dm-crypt)

cgwalters commented 4 years ago

I could imagine for example if Stratis itself provided a stable declarative language (not sure it does today) then we could just pass that through too.

jamescassell commented 4 years ago

I heard stratis was a failed experiment from @Conan-Kudo

Conan-Kudo commented 4 years ago

Ehh, it's still around and being developed (but not by the guy who designed it, who left), but I don't see enthusiasm from anybody over it.

dustymabe commented 4 years ago

I could imagine for example if Stratis itself provided a stable declarative language (not sure it does today) then we could just pass that through too.

If there were a good healthy project for this I would certainly love to depend on another project for this if we could, kind of like how we're passing off network management to NetworkManager.

Conan-Kudo commented 4 years ago

If there were a good healthy project for this I would certainly love to depend on another project for this if we could, kind of like how we're passing off network management to NetworkManager.

Wouldn't this be UDisks?

dustymabe commented 4 years ago

Wouldn't this be UDisks?

Maybe? I've never used it before.

cgwalters commented 4 years ago

Wouldn't this be UDisks?

That's a useful point, thanks; UDisks is also a daemon with a DBus API to manage storage that also doesn't currently have an Ignition-like declarative frontend. And you're right that it is well tested and maintained, and even currently supports dm-crypt as well as LVM.

However...UDisks is much more "raw" and less opinionated about the storage stack; Stratis for example tries to tie together various layers in a much more opinionated way and then having tooling that allows admins to not have to think about the "filesystem" versus "lvm" layers.

cgwalters commented 4 years ago

that also doesn't currently have an Ignition-like declarative frontend.

(OK there's kickstart but I think Anaconda is still using blivet and not udisks/storaged?)

cmurf commented 4 years ago

To support Stratis requires moving away from dd based installations. Stratis expects to have complete control over the entire stack. That suggests rsync?

I think Stratis makes sense from a Red Hat business perspective, there are many md/mdadm, device-mapper, LVM, and XFS developers. But I don't think technology wise it's that interesting for Fedora.

As it relates to this issue, I think more interesting is for CoreOS to revisit its Btrfs roots. A different direction was once necessary for entirely valid reasons.

use Btrfs as the install image filesystem, which doesn't tie you to using it by default on installations, just rsync over to a new file system of choice, including Stratis;
built-in media checking for free; since kernel 5.5 cryptographic hashes (blake2b, and sha256); future might include authenticated file systems;
directly bootable by GRUB; including native compression which eliminates the unxz step; on-going space savings;
seed/sprout feature combined with ZRAM device, gives the option to reduce the need for a big+sophisticated+complex initramfs, by live migrating the sysroot to ramdisk, meaning you have a full OS to use for Ignition (presently complicated by this systemd regression), similar to pvmove but much less complicated since all the code is in the kernel;
not much sophisticated support is needed for Btrfs anyway, because few features are mkfs time only; most everything else can be rearranged online.

Related, I wonder whether it'd be useful to remove Btrfs complexity in Anaconda, and use Ignition instead to do things like setup subvolumes and raid.

cgwalters commented 4 years ago

To support Stratis requires moving away from dd based installations.

No. This whole issue is about having Ignition support for reprovisioning the rootfs on boot. For a bare metal install, the flow would be to dd the "default" (ext4 /boot, xfs /) image to disk, and on boot the Ignition config would say to use stratis, and we copy the rootfs to RAM, set up stratis for /, then copy the rootfs back.

See the discussion above.

cgwalters commented 4 years ago

Yes, this may seem silly on bare metal to copy an image only to on the next boot rewrite it, but the big advantage is total parity between cloud and bare metal; one can just as easily use RAID/stratis/luks/whatever in AWS/GCP/etc. as one can on metal.

cmurf commented 4 years ago

we copy the rootfs to RAM ... then copy the rootfs back

What does copy mean? cp -a? rsync -a? xfs_copy?

cgwalters commented 4 years ago

What does copy mean? cp -a? rsync -a? xfs_copy?

I wanted to implement this in rpm-ostree, because ostree admin deploy is 90% of it. But, rsync would also be OK probably.

cgwalters commented 4 years ago

Stratis discussion:

Introduction to stratis.
[bgilbert] question about how stratis is used/configured
[johnb] stratis doesn't want you to care about the underlying devices, let stratis manage the devices and grow/manage the filesystem, etc
[bgilbert] may require change to ignition spec to adapt to stratis
[walters] strawman: stratis pool section in ignition and a stratis filesystem section in the ignition configs. may be better in FCCT.
[ashcrow] would this be the default provisioning path or opt-in?
[walters] opt-in
[dustymabe] one benefit here is not having to manage our own code to do partitioning, filesystem, etc. similar to how we use NetworkManager for networking. helps with overall maintenance of drives.
[bgilbert] does stratis support state reconciliation? i.e. if stratis is given a desired state, will stratis take actions to reach state?
[mulhern] the DBus API is idempotent and goes into API level. Should be able to handle desired state declaration.
[walters] use case: i have a bunch of existing storage I want to preserve (i.e. logical volume with VMs) during re-install. can stratis recognize the existing LV or stratis pool?
[mulhern] if you request a pool from stratis with particular characteristics (name, devices), stratis would reply it was there
[bgilbert] if a filesystem or LVM logical volume or disk partition is requested with a size...(?)
[mulhern] if the stratis interface did not work idempotently, it would be a bug
[bgilbert] user requests a partition with name+size, if it does not exist, ignition will created. if it does exist and has the wrong size or name, ignition has two modes to either clobber or bail out(?)
[mulhern] stratis vols are thinly provisioned with defaults, but we have been considering moving away from them (i.e. 1 TB size of volume)
[walters] my opinion is to do stratis over LVM, easier for users
[bgilbert] ignition runs in initrd, expects to have access to disks, can stratis run in the initrd?
[mulhern] can run in initrd, but IPC interface is via DBus, so there challenges to be solved there
[dustymabe] similar problems to running NM in initrd; want to run via systemd, but haven't solved the complete problem yet. considered bringing DBus into initrd
[bgilbert] what is the implementation lang?
[mulhern] Rust!
[bgilbert] if we have DBus in the initrd, any other problems we can think of?
[mulhern] not aware of any...but there could be surprises
[bgilbert] if we added stratis to igntion spec, it would no longer be orthoganal to existing storage spec
[slowrie] stratis section could handle lvm subset, but don't provide options?
[bgilbert] can't part out pieces of the problem, stratis wants to do it all
[mulhern] yup; handles volume creation and filesystem creation
[dustymabe] if i wanted a partition on a disk with a certain size, can stratis do that? or does it need pools, etc?
[mulhern] it might be possible
[walters] use case: i want stratis as my root device only, no additional devices?
[dustymabe] can we wholly replace our disk/partition/filesystem implementation with stratis?
[bgilbert] we cannot replace our story completely
[mulhern] the metaphor of "storage on Rails" is accurate

cgwalters commented 4 years ago

I've been thinking more about this topic because RHCOS+osmet is blocked on LUKS stuff and really full rootfs reprovisioning working from the initramfs is currently blocking on a kernel patch @jlebon is working on so we can read labels from the initramfs. Which puts getting this working much farther out than I'd like.

I think I suggested a variant of this before but I'm increasingly thinking we should pursue it (or at least, consider it):

When we need to do something "heavy" (rootfs reprovisioning or upgrades), add a special coreos-firstboot.target that runs in the real root.

Now, when we want to do rootfs reprovisioning we still copy the rootfs to RAM, but then we switchroot to it, do what we need to do from there - and at this point we have SELinux policy loaded so we aren't blocking on the kernel patch.

Then, we can switch root again to the new root.

(Or, in the case where we need to apply kernel arguments or upgrade or package layer, we reboot)

jlebon commented 4 years ago

I've been thinking more about this topic because RHCOS+osmet is blocked on LUKS stuff and really full rootfs reprovisioning working from the initramfs is currently blocking on a kernel patch @jlebon is working on so we can read labels from the initramfs. Which puts getting this working much farther out than I'd like.

I haven't fully thought through yet your proposal, though specifically re. the above, I wanted to note that I've now submitted the patch for this upstream: https://lore.kernel.org/selinux/20200523195130.409607-1-jlebon@redhat.com/. I've also previously discussed with stakeholders that backporting this patch was going to be a requirement for the rootfs reprovisioning work landing in RHCOS and there's consensus on needing to expedite the process. So it may not be as far out as it seems.

Conan-Kudo commented 4 years ago

I actually would like to be able to reconfigure everything to btrfs (mainly to make initial provisioning mistakes not be essentially deadly and also to get out of dealing with overlayfs for container storage). What stops making that possible?

Conan-Kudo commented 4 years ago

Wouldn't this be UDisks?

That's a useful point, thanks; UDisks is also a daemon with a DBus API to manage storage that also doesn't currently have an Ignition-like declarative frontend. And you're right that it is well tested and maintained, and even currently supports dm-crypt as well as LVM.

However...UDisks is much more "raw" and less opinionated about the storage stack; Stratis for example tries to tie together various layers in a much more opinionated way and then having tooling that allows admins to not have to think about the "filesystem" versus "lvm" layers.

We definitely don't want opinions about the storage stack at a layer where the system administrators want to be able to set it up the way they need it to. Having the solid, generic abstraction in the form of UDisks is probably the right way to go here.

dustymabe commented 4 years ago

I think I agree with @Conan-Kudo on this one.

jlebon commented 4 years ago

I wanted to note that I've now submitted the patch for this upstream: lore.kernel.org/selinux/20200523195130.409607-1-jlebon@redhat.com

If anyone is playing along or wants to test this on top of the ongoing LUKS work, I did a scratch build with that patch:

https://koji.fedoraproject.org/koji/taskinfo?taskID=45999230

jlebon commented 4 years ago

Been doing work in this area in the context of LUKS support, but that really requires fleshing out our root configuration/discovery story first. This is going to be a bit long, though I just want to make sure we're all on the same page (or if I'm missing something, please correct me!). The primary goal here should be to solve the generic case of arbitrarily complex and layered devices in the root hierarchy. To start, I think it'd be useful to look at concrete examples of possible rootfs setups we would like to support and how those are traditionally configured:

MD RAID: configured via mdadm.conf (in the common case, one can rely on auto-activation and default settings, but customizing anything requires a mdadm.conf).
multipath: configured via multipath.conf (in the common case, one can use the new rd.multipath=default, but customizing anything requires a multipath.conf).
LUKS: configured via /etc/crypttab or kargs. A subset of this is bound LUKS containers using Clevis; configuration for this is stored in the LUKS header itself as a token.
LVM (not supported yet by Ignition, but ideally eventually it would): automatic activation by default. There are LVM knobs in /etc, though AFAICT it's more for higher-level management things rather than pointing at what block device it should activate (there is a knob though to filter devices which you don't want auto-activated). Worth noting that Atomic Host used LVM by default and I don't recall the need to regenerate the initrd to include /etc/lvm/lvm.conf ever coming up.

So that's the configuration part. The other piece of the puzzle is having a "rootmap" (to reuse the same terminology as https://github.com/coreos/fedora-coreos-tracker/issues/94#issue-390413431), which is about what subset of complex devices do we actually need to activate in the initrd in order to get to the root device.

Now, it's worth looking at what traditional systems do for this. For example, as a test I installed a Fedora Server 32 with root on xfs on LUKS on RAID1. How this works is that Anaconda knows what the "leaf" block device it's installing to (obviously) and asks blivet what other devices the root device depends on:

https://github.com/rhinstaller/anaconda/blob/efe96d7a05431afd12fd4d92dcdfd5d6cc134cea/pyanaconda/bootloader/base.py#L759-L765

For every device which the root device depends on, it gets the kernel argument(s) that are needed to make sure it gets activated on boot:

https://github.com/rhinstaller/anaconda/blob/efe96d7a05431afd12fd4d92dcdfd5d6cc134cea/pyanaconda/bootloader/base.py#L775

See e.g. what that looks like for LUKS:

https://github.com/storaged-project/blivet/blob/c2874b729778a999b699dc6a8f77bd0c86d52a82/blivet/devices/luks.py#L158

So then the final kernel cmdline in my case was:

BOOT_IMAGE=(hd0,msdos2)/vmlinuz-5.6.6-300.fc32.x86_64 root=UUID=54687cc9-18c1-47a7-9fa8-e3c395d50e5f ro
rd.luks.uuid=luks-186006cd-e5ee-4e2d-988c-d6eb8dd917a2 rd.md.uuid=30b7f702:24b9a929:03434c8d:0ca8559d

(The rd.luks.uuid= and rd.md.uuid kargs are determined and injected by Anaconda itself via GRUB_CMDLINE_LINUX; the root= karg is determined by grub2-mkconfig and injected together with GRUB_CMDLINE_LINUX into the kernelopts grubenv, which is part of the BLS entry; thanks @martinezjavier!)

This is telling dracut/systemd:

activate the MD RAID device with UUID 30b7f702:24b9a929:03434c8d:0ca8559d
activate the LUKS device that shows up with UUID 186006cd-e5ee-4e2d-988c-d6eb8dd917a2
mount the root device that shows up with UUID 54687cc9-18c1-47a7-9fa8-e3c395d50e5f

It's important to note that this is just the rootmap, it's not the configuration itself. However, in the common case, either default configs are fine or it can be provided on the kernel command-line as well (for example, systemd-cryptsetup-generator supports luks.options).

So here's my proposal:

We provide the rootmap via the same kernel arguments as on traditional systems (e.g. rd.luks.uuid, rd.md.uuid). This allows us to leverage the same dracut modules/systemd generators.
We optimize for the common case where zero initramfs configuration is needed.
We provide a mechanism for cheaply providing initramfs configuration when it is needed. This means avoiding rebuilding the entire initramfs, which is expensive (both computationally, and conceptually wrt the "image" paradigm). We discussed mechanisms for this before, though what I'm currently sketching out is an rpm-ostree initramfs-cfg command. For example, rpm-ostree initramfs-cfg --track mdadm.conf would create an overlay initrd which is kept in sync with the host mdadm.conf. That would then show up in rpm-ostree status. That said, I see this more as a stage-2 requirement.

Another bit worth mentioning is that in this proposal, we keep the root label as the API for users to tell us what the root device is (this is how https://github.com/coreos/fedora-coreos-config/pull/184 works), but as argued in https://github.com/coreos/fedora-coreos-tracker/issues/465, we always add root=UUID=$uuid. This allows us to work around the issue mentioned in https://github.com/coreos/fedora-coreos-tracker/issues/94#issuecomment-516173085: we don't have to care whether the user is reconfiguring the root filesystem in place (i.e. simply changing /dev/sda4 from xfs to ext4), or moving it to some other block device entirely.

I think one of the tricky bits here will be to create the rootmap in the first place. This was discussed in https://github.com/coreos/fedora-coreos-tracker/issues/94#issuecomment-517045698 as well, but it essentially comes down to whether we want it to be explicit (via users specifying some additional information, such as some FCC sugar, or specific GUIDs to use) or implicit (by us figuring it out). I'm leaning more towards the latter though would like to hear others' thoughts.

Essentially, once /sysroot is mounted, I think it wouldn't be too hard to go up the mount hierarchy and note down its type and UUID, given that each type is going to be part of the limited set of device types that Ignition supports setting up. We should do this in a real language though, not bash. Thinking of just shoving it in a hidden rpm-ostree coreos-rootmap command for now. Crucially, that command would then also update the BLS config to append the arguments.

cgwalters commented 4 years ago

Thanks for the detailed writeup!

We provide the rootmap via the same kernel arguments as on traditional systems (e.g. rd.luks.uuid, rd.md.uuid). This allows us to leverage the same dracut modules/systemd generators.

Yep.

We provide a mechanism for cheaply providing initramfs configuration when it is needed.

Well we also went down the path in the various multipath PRs of potentially having /boot/etc/multipath.conf for example (IIRC). That could easily generalize to /boot/etc/mdadm.conf etc.

I think we should have a bit of a debate about this because, while inventing a new directory (/boot/etc) is not a trivial thing and imposes maintenance costs on all software we'd need to adapt for it...I hope people agree it's more elegant than appending to the initramfs. It follows the /usr (read-only vendor provided) versus /etc separation we use for the main OS.

darkmuggle commented 4 years ago

I think we should have a bit of a debate about this because, while inventing a new directory (/boot/etc) is not a trivial thing and imposes maintenance costs on all software we'd need to adapt for it...I hope people agree it's more elegant than appending to the initramfs. It follows the /usr (read-only vendor provided) versus /etc separation we use for the main OS.

One thing to consider: once the disk is claimed by mounting a file system then activation of raid or multipath cannot happen; that's why the config ends up in the initramfs. And using a boot path means iSCSI is out. For some configuration you need an appended initramfs or having it in the initramfs. And finally, boot paths does not work for netbooted configurations like PXE.

cgwalters commented 4 years ago

And using a boot path means iSCSI is out. For some configuration you need an appended initramfs or having it in the initramfs. And finally, boot paths does not work for netbooted configurations like PXE.

Right, iSCSI and multipath are about an entire physical block device (AFAIK). But here we're only scoping in the root partition, right? To take a concrete example, say I have a server with two disks of the same size and I want to use RAID 1. In this scenario, we'd end up with unused space on the second disk corresponding to the other partitions that aren't root (i.e. /boot and the ESP and the BIOS boot). We're not supporting RAID (or really, any complex storage) for /boot - right?

That avoids all circular dependencies around the boot partition.

Note that (as discussed in the multipath case too) it's not just us reading the label=boot partition - it's also GRUB. Any attempt to use complex storage for other components (most notably /boot and the ESP) basically doesn't work as far as I know. See e.g. this blog entry.

cgwalters commented 4 years ago

More links on this all saying the same thing:

So most people really just want "redundant boot partitions" - it probably would make sense to support that explicitly in userspace, basically automating what people are suggesting cobbling together via scripting. One advantage of this would be that people who chose to use e.g. btrfs/stratis for / with arbitrarily complex setups (such as LUKS/raid0) could still have "raid 1" for "boot data".
Once it's more mature this could be a feature of bootupd perhaps. (But clearly a disadvantage is it requires separate specification of "boot data" versus /)

cgwalters commented 4 years ago

And finally, boot paths does not work for netbooted configurations like PXE.

This though is definitely a strong point against /boot/etc.

EDIT: Wait, no - the flow here would be:

PXE boot into live OS w/coreos-installer and passing an Ignition config describing the desired state (e.g. "raid 1 for /")
coreos-installer runs targeting selected disk
reboot, Ignition runs and sets up the raid, writing to the label=boot partition of the single target disk, including /boot/etc there

That's it - the flow is identical to cloud and not special to PXE.

cgwalters commented 4 years ago

(BTW somewhat related to this - https://github.com/systemd/systemd/commit/a7e885587949b6793ccf389505f3c436315fa653 - but sadly not in RHEL8 so we can't depend on it yet...)

jlebon commented 4 years ago

I think we should have a bit of a debate about this because, while inventing a new directory (/boot/etc) is not a trivial thing and imposes maintenance costs on all software we'd need to adapt for it...I hope people agree it's more elegant than appending to the initramfs.

I would disagree. :) First, I think it helps to look at both approaches (/boot/etc and layered initrds) as mostly equivalent: they both simply boil down to adding stuff in /boot as a mechanism to provide files to the initrd stage. So they definitely share some of the same constraints.

However, I find a layered initrd much cleaner overall:

It's completely transparent to the initramfs code. This means that it can just as easily carry configuration for mdadm as for systemd, which is the very first configurable process that's started (and yes, we can keep asking systemd to support more knobs via kargs, but how sustainable is that?). With a /boot/etc config, you have to wait until the boot device becomes available, and by then, you would've already run a lot of code (at the very least you'd have to wait for udev, and e.g. that's after dracut's cmdline hook). Anything started before that which might've already peeked into /etc for its config would require hacks/workarounds (this reminds me of https://github.com/coreos/ignition/issues/635, see e.g. https://github.com/systemd/systemd/pull/11903 and https://github.com/coreos/ignition/pull/639). I mentioned this in the other threads, but to re-iterate here: we'd essentially be introducing the same design flaw at the initrd stage which lead to Ignition being created in the first place for the real root.
It does not require adding more code to the initramfs. Let's face it, our initramfs is complex and likely to keep increasing in complexity. The more logic we shove there, the harder it'll be to maintain. If there's a way to trade initramfs code for real root code, it's worth investigating. Another way to look at this is that we're trading update staging speed for bootup speed. And another way to look at it is simply that less code in the initramfs means less things that can go wrong at boot time.
It meshes well with trusted boot approaches; it's just another initrd to verify. See also this comment on the subject. With a /boot/etc, it would fall on us in a trusted boot context to do the verification.

Overall I find layered initrds fit nicely with the rest of the OS design. We repeat the same pattern elsewhere, such as with rpm-ostree (which allows one to overlay packages on top of the OSTree; fully transparently to the rest of the OS), and container runtimes (which are all about writable layers on top of non-writable ones; fully transparently to the rest of the container).

cgwalters commented 4 years ago

With a /boot/etc config, you have to wait until the boot device becomes available, and by then, you would've already run a lot of code (at the very least you'd have to wait for udev, and e.g. that's after dracut's cmdline hook).

Remember the bootloader has already found the boot partition, mounted it and read the kernel and initramfs off of it. It's actually a bit unfortunate that the only way we've invented to pass data from that stage into the initramfs is via kargs or appending to the initramfs.

Anything started before that which might've already peeked into /etc for its config would require hacks/workarounds

Yes, though those hacks mostly come into play I think because of the dracut cmdline hook causing all sorts of stuff to happen; otherwise we could cleanly order startup of units against a coreos-propagate-bootetc.service being part of e.g. basic.target.

We repeat the same pattern elsewhere, such as with rpm-ostree (which allows one to overlay packages on top of the OSTree; fully transparently to the rest of the OS),

Layered initramfs images though are super opaque and don't even have standard userspace tooling to unpack them the same way the kernel does (AFAIK). Last time I looked at that I just ended up booting a linux kernel w/ rd.break to look around. Clearly by having rpm-ostree "own" and track what's getting put in there helps a lot though.

jlebon commented 4 years ago

Remember the bootloader has already found the boot partition, mounted it and read the kernel and initramfs off of it. It's actually a bit unfortunate that the only way we've invented to pass data from that stage into the initramfs is via kargs or appending to the initramfs.

Yeah, agreed. If there was a way to do this from the GRUB config itself, that could be an interesting third alternative.

Layered initramfs images though are super opaque and don't even have standard userspace tooling to unpack them the same way the kernel does (AFAIK).

lsinitrd works fine on those:

# mkdir -p rootfs/etc && cd rootfs
# echo 'foobar' > etc/foobar.conf
# find . -mindepth 1 -print0 | cpio -o -H newc -R root:root --quiet --reproducible --force-local --null -D . | gzip -1 > /boot/overlay.img
# lsinitrd /boot/overlay.img
Image: /boot/overlay.img: 2.0K
========================================================================
Version:

Arguments:
dracut modules:
========================================================================
drwxr-xr-x   2 root     root            0 Jun 29 21:36 etc
-rw-r--r--   1 root     root            7 Jun 29 21:36 etc/foobar.conf
========================================================================
# mktemp -d
/tmp/tmp.sZcABusIEu
# cd /tmp/tmp.sZcABusIEu
# lsinitrd --unpack /boot/overlay.img
# cat etc/foobar.conf
foobar

jlebon commented 4 years ago

Ohhh I think there's a slight mismatch here. I think you're talking about appending to the initramfs (right?), while I was thinking more of adding another initrd key to the BLS entries.

I was initially thinking having them split since it probably won't change nearly as often as the initrd itself and for trusted boot we could eventually ship the initrds signed with our keys. But hmm, appending to the initramfs would certainly be easier implementation-wise. I guess how it's actually implemented could be changed too down the road.

You're right that lsinitrd doesn't handle multiple concatenated initrds well. (Might not be too hard to fix that though.) But yeah, a big part of this is that the changes are handled by rpm-ostree and reflected in rpm-ostree status, so ideally you wouldn't normally have to dig around to look at the payload itself.

cgwalters commented 4 years ago

while I was thinking more of adding another initrd key to the BLS entries.

Ah. I like that idea but I'm uncertain if it's supported by all bootloaders we care about (which boils down to grub and zipl, and of those we'd need to double check the latter). OTOH if it's not supported then maybe the best thing is to make it supported.

martinezjavier commented 4 years ago

Ah. I like that idea but I'm uncertain if it's supported by all bootloaders we care about (which boils down to grub and zipl, and of those we'd need to double check the latter). OTOH if it's not supported then maybe the best thing is to make it supported.

It's not supported by zipl (more info in ibm-s390-tools/s390-tools/issues/49). It's also not supported by Petitboot (ppc64le with OPAL) as far as I know since it uses the kexec command line tool and that only allows to pass a single initrd to the kernel for kexec.

jlebon commented 4 years ago

Ah. I like that idea but I'm uncertain if it's supported by all bootloaders we care about (which boils down to grub and zipl, and of those we'd need to double check the latter). OTOH if it's not supported then maybe the best thing is to make it supported.

It's not supported by zipl (more info in ibm-s390-tools/s390-tools/issues/49). It's also not supported by Petitboot (ppc64le with OPAL) as far as I know since it uses the kexec command line tool and that only allows to pass a single initrd to the kernel for kexec.

Thanks. I had come across the zipl issue, but hadn't realized ppc64le would also be an issue. Hmm, Petitboot runs in its own complete Linux environment, right? Could it concatenate the initrds itself and pass that to kexec?

I mentioned this in https://github.com/coreos/fedora-coreos-config/pull/548#discussion_r466780993 too, but to re-iterate: I think pursuing this path is still worth it overall. Let's try to get it supported on platforms that don't yet, and in the meantime, we can fall back to regenerating the whole initramfs.

coreos / fedora-coreos-tracker

Support for reconfiguring the root storage #94