Closed ajeddeloh closed 4 years ago
We discussed this briefly in the meeting today - There are two things I'd like (personal opinion) to do as part of this:
Andrew also mentioned that he'd like the solution to be generic enough that if we decide to support new use cases in the future our current solution doesn't prohibit us from doing so. If we build something new, this is reasonable.
Lots of discussion around partitioning in https://github.com/coreos/fedora-coreos-tracker/issues/18
Let's repurpose this issue as a tracker for support for reconfiguring the root filesystem in any form; RAID is a subset of that.
Bigger picture, this gets a lot more complex with the combination of FCOS:(ostree, ignition) as opposed to ContainerLinux:(dual partition /usr, ignition), because with CL replacing the rootfs is nearly trivial, but for ostree it is entangled with the OS (which adds flexibility, but does make this problem more complex).
There are two sub-paths to this problem: One where we're not destroying the existing rootfs (e.g. we're just adding a new partition that we want to make into the rootfs), and one where we're intending to replace it.
I think if we're trying to replace the rootfs...we get into an interesting problem domain because we need to pull the ostree repo into RAM temporarily. In most cases...hmm, that should be fine really. But, if we want to optimize things, I could imagine that coreos-installer
accepts Ignition, and if a replacement of the root filesystem is requested, we do that at install time as opposed to boot time. We'd then pass the remaining Ignition as /boot/ignition.ign
.
And we need support for running ostree in the initramfs - shouldn't be too hard.
The flow would be something like:
Hadn't considered pulling it into RAM; I like that idea. I was thinking about just pulling over network, but RAM has some obvious speed/bandwidth cost advantages. There's some tricky bits:
device
fields in the config since those could be using those symlinks.I could imagine that coreos-installer accepts Ignition, and if a replacement of the root filesystem is requested, we do that at install time as opposed to boot time
This doesn't help on clouds. I'd rather have a unified way. In some ways it would be easier to just do all of disks then, but that's got it's own challenges and deserves it's own ticket.
Proposal:
"label": "root"
. We can't (reasonably) wait on udev for the device since if it's on raid or similar it won't become available until after ignition-disks. This generates false positives for cases like resizing root to be bigger."label": "root"
. I can't think of a reason to include a rootfs definition matching the default anyway."path": "/"
since the initramfs owns mounting thoseostree-redeploy-rootfs
needs to happen before ignition mount.Sequence would look like: 1) ignition-render 1) detect-root-change (needs to understand ignition spec but probably shouldn't be part of the ignition project) 1) ignition disks 1) rootfs mounted 1) ostree-redeploy-rootfs 1) ignition mount 1) ignition files 1) ignition umount 1) switch root
This still leaves open the question of "how do we find/mount root on subsequent boots".
An unrelated question is: do we support blowing away /boot? My gut says no. On x86_64 you can blow away /efi
when running on bios (may need to mask the mount unit for it) or BIOS-BOOT
when running on EFI. I don't think we can get rid of boot though. It's also partition 1 so it shouldn't get in the way of anything and it's not huge. We probably could find a way to allow moving it but I really don't think it's worth the complexity that would entail.
We can't do it if ostree size > available ram, where available ram will be slightly smaller than apparent because we need to hold it in ram while ignition disks runs
Yeah I had the same hesitation but then I realized two things. First, the current FCOS is 1.5G. I think if you're doing something like RAID we can presume you have a "serious" system and have at least say 8G of RAM.
Second, we could enable zram temporarily.
I realized I initially missed the bit about "what if we're just moving the rootfs and the old one will still be accessible". Do you think it's worth the complexity to implement that? If so I think we ought to do this in two phases, handling the generic case first then treating that as an optimization.
Second, we could enable zram temporarily.
Neat! I think we ought to handle that second and only enable it when we don't have enough ram to do it without.
We can't (reasonably) wait on udev for the device since if it's on raid or similar it won't become available until after ignition-disks. This generates false positives for cases like resizing root to be bigger.
I'm confused...doesn't this stuff imply Ignition running twice? We know the initial rootfs we're booting from isn't on RAID because our default disk image doesn't do that.
I'm confused...doesn't this stuff imply Ignition running twice? We know the initial rootfs we're booting from isn't on RAID because our default disk image doesn't do that.
I'm not sure why it would imply running Ignition twice? Unless you're referring to getting the config, in which case that's why I'm suggesting we split that into its own stage. I'm saying it's nontrivial to know with 100% certainty that the root is being destroyed since the block device has aliases that we would need to wait for. I suppose we could, but that makes the live pxe case harder since there isn't a device and some aliases aren't known before boot.
Consider if FCOS is installed with /dev/sda4 being root. That can be referenced by /dev/sda4
, /dev/disk/by-label/root
and /dev/disk/by-partlabel/root
(and some others that are system dependent). I'm saying we can't detect if a user is blowing away the partition because:
1) We don't know what disk we're installed on (maybe we should add a udev rule for that? create something like /dev/disk/installed
. Seems useful in general and could improve config portability), so if they're wiping the filesystem with device: sda4
we don't know if that is actually the rootfs or not since we might be installed on sdb instead.
2) Those symlinks may not exist yet so if the user is using them we don't know where they point. Things like /dev/disk/by-path/*
are going to be dependent on the hardware so we really don't know if they're pointing to the rootfs. We can't wait for them either because while /dev/disk/by-label/root
pointing to the rootfs is obvious, something like /dev/disk/by-path/<something>
is not. We don't have a way to differentiate those from other symlinks that could appear as a result of ignition-disks. We don't know which ones are safe to wait on and which are not.
Because of that we should be detecting that we're creating a new rootfs instead of deleting the old one. We can either do that with partitions looking for a partlabel (which can have false positives when just resizing) or with the filesystem label (which can have false positives if they create an entry that matches the on-disk FS (see https://github.com/coreos/ignition/blob/master/doc/operator-notes.md#filesystem-reuse-semantics), though I'm not sure why you'd do that).
We discussed this at the FCOS IRC meeting today, mostly about how to handle the second boot. There's essentially five options:
1) Do what Container Linux does and require partitions containing complex devices (e.g luks volumes, parts of RAID, etc) use GPT type guids to specify what they are. The initramfs can then carry udev rules to start these devices
rd.auto
)In cases 2-5 there's also the question of "how do we specify what is needed to start the root disk?" Case 1 it's implied that you specify it with type guids. We could:
/dev/md/name
instead of /dev/mdXXX
)Bit of data: looks like the kcmdline size is 2048 on all the arches we care about except s390x where it is 896.
Carrying over from #18 since it's closed.
Re: dd
and GPT: 512n and 512e can share one GPT, 4Kn needs a different GPT. I don't see how they can be merged without invalidating Partition Entry Array CRC32 which is computed on the content between the header and First Useable LBA. And First Useable LBA on 512 byte sector devices must be 34 or higher.
Re: mkfs.xfs uses a 4096 byte sector size on 512e and 4Kn, but uses 512 bytes on 512n.
Those two above combined suggests three raw images, unless something gives.
Also, when dd'd to a different sized drive, the GPT becomes invalid, so that should probably get fixed as a first step.
Each RAID layout should get its own mkfs.xfs for setting stripe width/unit
On 4Kn drives, minimum EFI system partition size is 256MiB if it's FAT32, which the UEFI spec does suggest it should be, sorta. But mkdosfs only starts using FAT32 above something near 550MiB, unless -F 32
is used. fedora-coreos-30.20190905.0-metal.raw has a FAT16 ESP. Obviously it works, and I haven't ever heard of FAT16 not working on anything.
Re: ZRAM dynamically allocates RAM, setting up swap on ZRAM at 1:1 to RAM has inconsequential overhead on a system that doesn't need it, but you could do a test similar to Anaconda's implementation and only start it for low memory devices. I'd like to see this upstream version working and rebase the myriad implementations on it. https://github.com/systemd/zram-generator
I'm working on the fundamentals of this in https://github.com/coreos/ignition-dracut/pull/107
First MVP is to support this fcct:
# Example of redeploying the rootfs as ext4
variant: fcos
version: 1.0.0
storage:
filesystems:
- device: /dev/disk/by-partlabel/root
format: ext4
wipe_filesystem: true
label: root
Tracker for issues:
To reiterate on some discussion in that PR, this is to help enable https://github.com/openshift/enhancements/pull/15 without requiring the root filesystem to come in a LUKS container by default (and relatedly, ensure that when a user chooses LUKS it can be encrypted from the start with a TPM2-sealed key, etc.)
Sorry, I didn't mean to close. I have NO idea why "close and comment" is next to the comment button.
https://lore.kernel.org/linux-fsdevel/157867221257.17873.1775090991929862549.stgit@devnote2/ proposes appending to the initrd as a way to provide larger kernel configuration data; relevant here.
I was talking with a Stratis developer today. One thing worth thinking about is whether it would make sense to potentially focus on having Ignition configure Stratis. That way we get out of the business of doing e.g. raid/lvm/dm-crypt directly ourselves. (He said Stratis is working on dm-crypt)
I could imagine for example if Stratis itself provided a stable declarative language (not sure it does today) then we could just pass that through too.
I heard stratis was a failed experiment from @Conan-Kudo
Ehh, it's still around and being developed (but not by the guy who designed it, who left), but I don't see enthusiasm from anybody over it.
I could imagine for example if Stratis itself provided a stable declarative language (not sure it does today) then we could just pass that through too.
If there were a good healthy project for this I would certainly love to depend on another project for this if we could, kind of like how we're passing off network management to NetworkManager.
If there were a good healthy project for this I would certainly love to depend on another project for this if we could, kind of like how we're passing off network management to NetworkManager.
Wouldn't this be UDisks?
Wouldn't this be UDisks?
That's a useful point, thanks; UDisks is also a daemon with a DBus API to manage storage that also doesn't currently have an Ignition-like declarative frontend. And you're right that it is well tested and maintained, and even currently supports dm-crypt as well as LVM.
However...UDisks is much more "raw" and less opinionated about the storage stack; Stratis for example tries to tie together various layers in a much more opinionated way and then having tooling that allows admins to not have to think about the "filesystem" versus "lvm" layers.
that also doesn't currently have an Ignition-like declarative frontend.
(OK there's kickstart but I think Anaconda is still using blivet and not udisks/storaged?)
To support Stratis requires moving away from dd based installations. Stratis expects to have complete control over the entire stack. That suggests rsync?
I think Stratis makes sense from a Red Hat business perspective, there are many md/mdadm, device-mapper, LVM, and XFS developers. But I don't think technology wise it's that interesting for Fedora.
As it relates to this issue, I think more interesting is for CoreOS to revisit its Btrfs roots. A different direction was once necessary for entirely valid reasons.
Related, I wonder whether it'd be useful to remove Btrfs complexity in Anaconda, and use Ignition instead to do things like setup subvolumes and raid.
To support Stratis requires moving away from dd based installations.
No. This whole issue is about having Ignition support for reprovisioning the rootfs on boot. For a bare metal install, the flow would be to dd the "default" (ext4 /boot, xfs /) image to disk, and on boot the Ignition config would say to use stratis, and we copy the rootfs to RAM, set up stratis for /, then copy the rootfs back.
See the discussion above.
Yes, this may seem silly on bare metal to copy an image only to on the next boot rewrite it, but the big advantage is total parity between cloud and bare metal; one can just as easily use RAID/stratis/luks/whatever in AWS/GCP/etc. as one can on metal.
we copy the rootfs to RAM ... then copy the rootfs back
What does copy mean? cp -a? rsync -a? xfs_copy?
What does copy mean? cp -a? rsync -a? xfs_copy?
I wanted to implement this in rpm-ostree, because ostree admin deploy
is 90% of it. But, rsync
would also be OK probably.
Stratis discussion:
I've been thinking more about this topic because RHCOS+osmet is blocked on LUKS stuff and really full rootfs reprovisioning working from the initramfs is currently blocking on a kernel patch @jlebon is working on so we can read labels from the initramfs. Which puts getting this working much farther out than I'd like.
I think I suggested a variant of this before but I'm increasingly thinking we should pursue it (or at least, consider it):
When we need to do something "heavy" (rootfs reprovisioning or upgrades), add a special coreos-firstboot.target
that runs in the real root.
Now, when we want to do rootfs reprovisioning we still copy the rootfs to RAM, but then we switchroot to it, do what we need to do from there - and at this point we have SELinux policy loaded so we aren't blocking on the kernel patch.
Then, we can switch root again to the new root.
(Or, in the case where we need to apply kernel arguments or upgrade or package layer, we reboot)
I've been thinking more about this topic because RHCOS+osmet is blocked on LUKS stuff and really full rootfs reprovisioning working from the initramfs is currently blocking on a kernel patch @jlebon is working on so we can read labels from the initramfs. Which puts getting this working much farther out than I'd like.
I haven't fully thought through yet your proposal, though specifically re. the above, I wanted to note that I've now submitted the patch for this upstream: https://lore.kernel.org/selinux/20200523195130.409607-1-jlebon@redhat.com/. I've also previously discussed with stakeholders that backporting this patch was going to be a requirement for the rootfs reprovisioning work landing in RHCOS and there's consensus on needing to expedite the process. So it may not be as far out as it seems.
I actually would like to be able to reconfigure everything to btrfs (mainly to make initial provisioning mistakes not be essentially deadly and also to get out of dealing with overlayfs for container storage). What stops making that possible?
Wouldn't this be UDisks?
That's a useful point, thanks; UDisks is also a daemon with a DBus API to manage storage that also doesn't currently have an Ignition-like declarative frontend. And you're right that it is well tested and maintained, and even currently supports dm-crypt as well as LVM.
However...UDisks is much more "raw" and less opinionated about the storage stack; Stratis for example tries to tie together various layers in a much more opinionated way and then having tooling that allows admins to not have to think about the "filesystem" versus "lvm" layers.
We definitely don't want opinions about the storage stack at a layer where the system administrators want to be able to set it up the way they need it to. Having the solid, generic abstraction in the form of UDisks is probably the right way to go here.
I think I agree with @Conan-Kudo on this one.
I wanted to note that I've now submitted the patch for this upstream: lore.kernel.org/selinux/20200523195130.409607-1-jlebon@redhat.com
If anyone is playing along or wants to test this on top of the ongoing LUKS work, I did a scratch build with that patch:
https://koji.fedoraproject.org/koji/taskinfo?taskID=45999230
Been doing work in this area in the context of LUKS support, but that really requires fleshing out our root configuration/discovery story first. This is going to be a bit long, though I just want to make sure we're all on the same page (or if I'm missing something, please correct me!). The primary goal here should be to solve the generic case of arbitrarily complex and layered devices in the root hierarchy. To start, I think it'd be useful to look at concrete examples of possible rootfs setups we would like to support and how those are traditionally configured:
mdadm.conf
(in the common case, one can rely on auto-activation and default settings, but customizing anything requires a mdadm.conf
).multipath.conf
(in the common case, one can use the new rd.multipath=default
, but customizing anything requires a multipath.conf
)./etc/crypttab
or kargs. A subset of this is bound LUKS containers using Clevis; configuration for this is stored in the LUKS header itself as a token./etc
, though AFAICT it's more for higher-level management things rather than pointing at what block device it should activate (there is a knob though to filter devices which you don't want auto-activated). Worth noting that Atomic Host used LVM by default and I don't recall the need to regenerate the initrd to include /etc/lvm/lvm.conf
ever coming up.So that's the configuration part. The other piece of the puzzle is having a "rootmap" (to reuse the same terminology as https://github.com/coreos/fedora-coreos-tracker/issues/94#issue-390413431), which is about what subset of complex devices do we actually need to activate in the initrd in order to get to the root device.
Now, it's worth looking at what traditional systems do for this. For example, as a test I installed a Fedora Server 32 with root on xfs on LUKS on RAID1. How this works is that Anaconda knows what the "leaf" block device it's installing to (obviously) and asks blivet what other devices the root device depends on:
For every device which the root device depends on, it gets the kernel argument(s) that are needed to make sure it gets activated on boot:
See e.g. what that looks like for LUKS:
So then the final kernel cmdline in my case was:
BOOT_IMAGE=(hd0,msdos2)/vmlinuz-5.6.6-300.fc32.x86_64 root=UUID=54687cc9-18c1-47a7-9fa8-e3c395d50e5f ro
rd.luks.uuid=luks-186006cd-e5ee-4e2d-988c-d6eb8dd917a2 rd.md.uuid=30b7f702:24b9a929:03434c8d:0ca8559d
(The rd.luks.uuid=
and rd.md.uuid
kargs are determined and injected by Anaconda itself via GRUB_CMDLINE_LINUX
; the root=
karg is determined by grub2-mkconfig
and injected together with GRUB_CMDLINE_LINUX
into the kernelopts
grubenv, which is part of the BLS entry; thanks @martinezjavier!)
This is telling dracut/systemd:
It's important to note that this is just the rootmap, it's not the configuration itself. However, in the common case, either default configs are fine or it can be provided on the kernel command-line as well (for example, systemd-cryptsetup-generator
supports luks.options
).
So here's my proposal:
rd.luks.uuid
, rd.md.uuid
). This allows us to leverage the same dracut modules/systemd generators.rpm-ostree initramfs-cfg
command. For example, rpm-ostree initramfs-cfg --track mdadm.conf
would create an overlay initrd which is kept in sync with the host mdadm.conf
. That would then show up in rpm-ostree status
. That said, I see this more as a stage-2 requirement.Another bit worth mentioning is that in this proposal, we keep the root
label as the API for users to tell us what the root device is (this is how https://github.com/coreos/fedora-coreos-config/pull/184 works), but as argued in https://github.com/coreos/fedora-coreos-tracker/issues/465, we always add root=UUID=$uuid
. This allows us to work around the issue mentioned in https://github.com/coreos/fedora-coreos-tracker/issues/94#issuecomment-516173085: we don't have to care whether the user is reconfiguring the root filesystem in place (i.e. simply changing /dev/sda4
from xfs to ext4), or moving it to some other block device entirely.
I think one of the tricky bits here will be to create the rootmap in the first place. This was discussed in https://github.com/coreos/fedora-coreos-tracker/issues/94#issuecomment-517045698 as well, but it essentially comes down to whether we want it to be explicit (via users specifying some additional information, such as some FCC sugar, or specific GUIDs to use) or implicit (by us figuring it out). I'm leaning more towards the latter though would like to hear others' thoughts.
Essentially, once /sysroot
is mounted, I think it wouldn't be too hard to go up the mount hierarchy and note down its type and UUID, given that each type is going to be part of the limited set of device types that Ignition supports setting up. We should do this in a real language though, not bash. Thinking of just shoving it in a hidden rpm-ostree coreos-rootmap
command for now. Crucially, that command would then also update the BLS config to append the arguments.
Thanks for the detailed writeup!
We provide the rootmap via the same kernel arguments as on traditional systems (e.g. rd.luks.uuid, rd.md.uuid). This allows us to leverage the same dracut modules/systemd generators.
Yep.
We provide a mechanism for cheaply providing initramfs configuration when it is needed.
Well we also went down the path in the various multipath PRs of potentially having /boot/etc/multipath.conf
for example (IIRC). That could easily generalize to /boot/etc/mdadm.conf
etc.
I think we should have a bit of a debate about this because, while inventing a new directory (/boot/etc
) is not a trivial thing and imposes maintenance costs on all software we'd need to adapt for it...I hope people agree it's more elegant than appending to the initramfs. It follows the /usr
(read-only vendor provided) versus /etc
separation we use for the main OS.
I think we should have a bit of a debate about this because, while inventing a new directory (
/boot/etc
) is not a trivial thing and imposes maintenance costs on all software we'd need to adapt for it...I hope people agree it's more elegant than appending to the initramfs. It follows the/usr
(read-only vendor provided) versus/etc
separation we use for the main OS.
One thing to consider: once the disk is claimed by mounting a file system then activation of raid or multipath cannot happen; that's why the config ends up in the initramfs. And using a boot
path means iSCSI is out. For some configuration you need an appended initramfs or having it in the initramfs. And finally, boot
paths does not work for netbooted configurations like PXE.
And using a boot path means iSCSI is out. For some configuration you need an appended initramfs or having it in the initramfs. And finally, boot paths does not work for netbooted configurations like PXE.
Right, iSCSI and multipath are about an entire physical block device (AFAIK). But here we're only scoping in the root partition, right? To take a concrete example, say I have a server with two disks of the same size and I want to use RAID 1. In this scenario, we'd end up with unused space on the second disk corresponding to the other partitions that aren't root (i.e. /boot
and the ESP and the BIOS boot). We're not supporting RAID (or really, any complex storage) for /boot
- right?
That avoids all circular dependencies around the boot partition.
Note that (as discussed in the multipath case too) it's not just us reading the label=boot
partition - it's also GRUB. Any attempt to use complex storage for other components (most notably /boot
and the ESP) basically doesn't work as far as I know. See e.g. this blog entry.
More links on this all saying the same thing:
So most people really just want "redundant boot partitions" - it probably would make sense to support that explicitly in userspace, basically automating what people are suggesting cobbling together via scripting. One advantage of this would be that people who chose to use e.g. btrfs/stratis for /
with arbitrarily complex setups (such as LUKS/raid0) could still have "raid 1" for "boot data".
Once it's more mature this could be a feature of bootupd perhaps. (But clearly a disadvantage is it requires separate specification of "boot data" versus /
)
And finally, boot paths does not work for netbooted configurations like PXE.
This though is definitely a strong point against /boot/etc
.
EDIT: Wait, no - the flow here would be:
label=boot
partition of the single target disk, including /boot/etc
thereThat's it - the flow is identical to cloud and not special to PXE.
(BTW somewhat related to this - https://github.com/systemd/systemd/commit/a7e885587949b6793ccf389505f3c436315fa653 - but sadly not in RHEL8 so we can't depend on it yet...)
I think we should have a bit of a debate about this because, while inventing a new directory (
/boot/etc
) is not a trivial thing and imposes maintenance costs on all software we'd need to adapt for it...I hope people agree it's more elegant than appending to the initramfs.
I would disagree. :) First, I think it helps to look at both approaches (/boot/etc
and layered initrds) as mostly equivalent: they both simply boil down to adding stuff in /boot
as a mechanism to provide files to the initrd stage. So they definitely share some of the same constraints.
However, I find a layered initrd much cleaner overall:
/boot/etc
config, you have to wait until the boot device becomes available, and by then, you would've already run a lot of code (at the very least you'd have to wait for udev, and e.g. that's after dracut's cmdline
hook). Anything started before that which might've already peeked into /etc
for its config would require hacks/workarounds (this reminds me of https://github.com/coreos/ignition/issues/635, see e.g. https://github.com/systemd/systemd/pull/11903 and https://github.com/coreos/ignition/pull/639). I mentioned this in the other threads, but to re-iterate here: we'd essentially be introducing the same design flaw at the initrd stage which lead to Ignition being created in the first place for the real root./boot/etc
, it would fall on us in a trusted boot context to do the verification.Overall I find layered initrds fit nicely with the rest of the OS design. We repeat the same pattern elsewhere, such as with rpm-ostree (which allows one to overlay packages on top of the OSTree; fully transparently to the rest of the OS), and container runtimes (which are all about writable layers on top of non-writable ones; fully transparently to the rest of the container).
With a /boot/etc config, you have to wait until the boot device becomes available, and by then, you would've already run a lot of code (at the very least you'd have to wait for udev, and e.g. that's after dracut's cmdline hook).
Remember the bootloader has already found the boot partition, mounted it and read the kernel and initramfs off of it. It's actually a bit unfortunate that the only way we've invented to pass data from that stage into the initramfs is via kargs or appending to the initramfs.
Anything started before that which might've already peeked into /etc for its config would require hacks/workarounds
Yes, though those hacks mostly come into play I think because of the dracut cmdline hook causing all sorts of stuff to happen; otherwise we could cleanly order startup of units against a coreos-propagate-bootetc.service
being part of e.g. basic.target
.
We repeat the same pattern elsewhere, such as with rpm-ostree (which allows one to overlay packages on top of the OSTree; fully transparently to the rest of the OS),
Layered initramfs images though are super opaque and don't even have standard userspace tooling to unpack them the same way the kernel does (AFAIK). Last time I looked at that I just ended up booting a linux kernel w/ rd.break
to look around. Clearly by having rpm-ostree "own" and track what's getting put in there helps a lot though.
Remember the bootloader has already found the boot partition, mounted it and read the kernel and initramfs off of it. It's actually a bit unfortunate that the only way we've invented to pass data from that stage into the initramfs is via kargs or appending to the initramfs.
Yeah, agreed. If there was a way to do this from the GRUB config itself, that could be an interesting third alternative.
Layered initramfs images though are super opaque and don't even have standard userspace tooling to unpack them the same way the kernel does (AFAIK).
lsinitrd
works fine on those:
# mkdir -p rootfs/etc && cd rootfs
# echo 'foobar' > etc/foobar.conf
# find . -mindepth 1 -print0 | cpio -o -H newc -R root:root --quiet --reproducible --force-local --null -D . | gzip -1 > /boot/overlay.img
# lsinitrd /boot/overlay.img
Image: /boot/overlay.img: 2.0K
========================================================================
Version:
Arguments:
dracut modules:
========================================================================
drwxr-xr-x 2 root root 0 Jun 29 21:36 etc
-rw-r--r-- 1 root root 7 Jun 29 21:36 etc/foobar.conf
========================================================================
# mktemp -d
/tmp/tmp.sZcABusIEu
# cd /tmp/tmp.sZcABusIEu
# lsinitrd --unpack /boot/overlay.img
# cat etc/foobar.conf
foobar
Ohhh I think there's a slight mismatch here. I think you're talking about appending to the initramfs (right?), while I was thinking more of adding another initrd
key to the BLS entries.
I was initially thinking having them split since it probably won't change nearly as often as the initrd itself and for trusted boot we could eventually ship the initrds signed with our keys. But hmm, appending to the initramfs would certainly be easier implementation-wise. I guess how it's actually implemented could be changed too down the road.
You're right that lsinitrd
doesn't handle multiple concatenated initrds well. (Might not be too hard to fix that though.) But yeah, a big part of this is that the changes are handled by rpm-ostree
and reflected in rpm-ostree status
, so ideally you wouldn't normally have to dig around to look at the payload itself.
while I was thinking more of adding another initrd key to the BLS entries.
Ah. I like that idea but I'm uncertain if it's supported by all bootloaders we care about (which boils down to grub and zipl, and of those we'd need to double check the latter). OTOH if it's not supported then maybe the best thing is to make it supported.
Ah. I like that idea but I'm uncertain if it's supported by all bootloaders we care about (which boils down to grub and zipl, and of those we'd need to double check the latter). OTOH if it's not supported then maybe the best thing is to make it supported.
It's not supported by zipl (more info in ibm-s390-tools/s390-tools/issues/49). It's also not supported by Petitboot (ppc64le with OPAL) as far as I know since it uses the kexec command line tool and that only allows to pass a single initrd to the kernel for kexec.
Ah. I like that idea but I'm uncertain if it's supported by all bootloaders we care about (which boils down to grub and zipl, and of those we'd need to double check the latter). OTOH if it's not supported then maybe the best thing is to make it supported.
It's not supported by zipl (more info in ibm-s390-tools/s390-tools/issues/49). It's also not supported by Petitboot (ppc64le with OPAL) as far as I know since it uses the kexec command line tool and that only allows to pass a single initrd to the kernel for kexec.
Thanks. I had come across the zipl issue, but hadn't realized ppc64le would also be an issue. Hmm, Petitboot runs in its own complete Linux environment, right? Could it concatenate the initrds itself and pass that to kexec
?
I mentioned this in https://github.com/coreos/fedora-coreos-config/pull/548#discussion_r466780993 too, but to re-iterate: I think pursuing this path is still worth it overall. Let's try to get it supported on platforms that don't yet, and in the meantime, we can fall back to regenerating the whole initramfs.
FCOS will have a static, server-side generated initramfs. This means the initramfs needs to know how to find the root filesystem to mount it, including if that means starting RAID devices. However, we do NOT want to start other raid devices (unless Ignition is starting them as part of the creation process) since users might be writing out configuration to things like
/etc/mdadm.conf
that they want to use.On CL we require that RAIDed devices be partitions with the a special GPT partition type guid and have a udev rule to start only those.
For FCOS I can see two options:
/boot
which is a "rootmap". This would describe the "path to root" including what services like mdadm need to be started. Ignition could generate this if we include a way of flagging partitions/RAIDs as "containing root" or it could be user written. If it is user written we should unmount root after Ignition files/umount and then let whatever tooling reads it mount the root itself to ensure the first boot is not "special" in any way.