Closed jmarrero closed 4 days ago
We might want to increase the /boot partition size to 512MB. I vaguely remember that we already talked about that somewhere but I don't remember where.
A few things here.
I've had a half-baked thought/design for a while that we should split out the Ignition bits from the initramfs into a separate file that lives in e.g. /usr/lib/coreos/initramfs-firstboot.img
- then on firstboot, we mount the real root, add all that stuff to the initramfs, then continue. (To make this work sanely we'd still need the systemd units in the initramfs, we're just dynamically loading the binaries). A notable benefit of such a change would be that subsequent boots should be faster because we need to decompress a much smaller initramfs.
The FCOS & RHCOS kernel + initrd currently sit at 97M (94M for FCOS). With a 384M /boot
, that's enough for 3 (kernel, initrd)
pairs only, which by default libostree uses all three (staged, booted, and rollback). Even just one additional pinned deployment will blow it up. So I think even if we don't project the kernel or initrd growing by much, we should still grow the bootfs.
I've had a half-baked thought/design for a while that we should split out the Ignition bits from the initramfs into a separate file that lives in e.g.
/usr/lib/coreos/initramfs-firstboot.img
- then on firstboot, we mount the real root, add all that stuff to the initramfs, then continue. (To make this work sanely we'd still need the systemd units in the initramfs, we're just dynamically loading the binaries). A notable benefit of such a change would be that subsequent boots should be faster because we need to decompress a much smaller initramfs.
Interesting idea. A simpler approach I think is to make it a layered initrd instead in /boot
. I bet we could even have it be another var similar to $ignition_firstboot
but which expands to e.g. initrd /boot/coreos/firstboot.img
in the BLS. (That wouldn't work on s390x though.) And then delete it on firstboot. We should calculate first how much we'd actually shave off this way.
Another half-baked idea I've had is to make libostree use an OSTree repo in /boot
too so we save on duplicate kernels and initrds (though in practice it wouldn't help too much since almost every FCOS/RHCOS update has a kernel update, but it'd help e.g. Silverblue and other faster moving systems).
I bet we could even have it be another var similar to $ignition_firstboot but which expands to e.g. initrd /boot/coreos/firstboot.img in the BLS. (That wouldn't work on s390x though.) And then delete it on firstboot.
Entirely removing the firstboot code would make it harder to factory reset or at least a "soft factory reset" that only e.g. reruns ignition files.
Another half-baked idea I've had is to make libostree use an OSTree repo in /boot too so we save on duplicate kernels and initrds (though in practice it wouldn't help too much since almost every FCOS/RHCOS update has a kernel update, but it'd help e.g. Silverblue and other faster moving systems).
Right; if the kernel changes, then the initramfs changes. We can hence only share kernels with different initramfs images. In the case where something in the initramfs changed but not the kernel, in theory at max we could shave (3-1)*kernel = ~24MB which isn't bad, but still 6% of the size of current /boot
.
If we figured out how to move parts of the initramfs to /usr
the space savings in /boot
multiplies by 3, not 2.
So looking at e.g.:
19M usr/bin/ignition
which is 6.2M gzip'd all on its own.
Another thing we ship is
4.5M usr/bin/afterburn
Which gzip's to 2.0M, so even just splitting out the ignition and afterburn binaries into "dynamically loaded initramfs from /usr" gets us 3*8.2=~24.6M of savings in the much more common case where ignition and afterburn versions remain the same across all 3 deployments.
Another way to say all this is...the role of the initramfs originally was just to mount the root filesystem. Us running ignition from the initramfs makes sense, but it doesn't mean ignition has to physically live in the initramfs.
In the end state our initramfs for example doesn't need to physically contain NetworkManager for example either. Or for that matter, kernel network drivers.
A tricky thing here though is we would obviously continue to need to ship a single initramfs image that contains ignition for the PXE case, meaning we'd need to have two initramfs images, but ideally the single one is just a concatentation of the split.
Another thing to investigate is to compress the initrd with zstd. In a quick test:
[root@cosa-devsh tmp]# /usr/lib/dracut/skipcpio /boot/ostree/fedora-coreos-3a5adc56c49a5e3392e03cb3475e92d466680e5a0b0533768365d964bc758843/initramfs-5.18.9-200.fc36.x86_64.img > /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# gunzip -k /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# zstd -19 -T0 /var/tmp/initrd.img -o /var/tmp/initrd.img.zstd
[root@cosa-devsh tmp]# ls -lh /var/tmp/initrd*
-rw-r--r--. 1 root root 140M Jul 7 20:55 /var/tmp/initrd.img
-rw-r--r--. 1 root root 77M Jul 7 20:55 /var/tmp/initrd.img.gz
-rw-r--r--. 1 root root 68M Jul 7 20:51 /var/tmp/initrd.img.zstd
So that's a 9M * 3 = 27M of savings.
Though it's currently only supported in the Fedora kernel. There's also bzip2 and xz, but the decompression lag on boot might not be acceptable.
I bet we could even have it be another var similar to $ignition_firstboot but which expands to e.g. initrd /boot/coreos/firstboot.img in the BLS. (That wouldn't work on s390x though.) And then delete it on firstboot.
Entirely removing the firstboot code would make it harder to factory reset or at least a "soft factory reset" that only e.g. reruns ignition files.
Hmm, if supporting a soft reset is considered in scope then there's also the fact that the binaries would slowly drift from the base version. Recent discussions about factory reset though have been leaning more towards the harder dd
kind.
Another way to say all this is...the role of the initramfs originally was just to mount the root filesystem. Us running ignition from the initramfs makes sense, but it doesn't mean ignition has to physically live in the initramfs.
In the end state our initramfs for example doesn't need to physically contain NetworkManager for example either. Or for that matter, kernel network drivers.
I think the idea has a lot of merit. My primary concern would be implementation complexity in an already extremely complex initramfs.
Previous discussion on the topic also in https://github.com/coreos/fedora-coreos-tracker/issues/855
Another thing to investigate is to compress the initrd with zstd. In a quick test:
[root@cosa-devsh tmp]# /usr/lib/dracut/skipcpio /boot/ostree/fedora-coreos-3a5adc56c49a5e3392e03cb3475e92d466680e5a0b0533768365d964bc758843/initramfs-5.18.9-200.fc36.x86_64.img > /var/tmp/initrd.img.gz [root@cosa-devsh tmp]# gunzip -k /var/tmp/initrd.img.gz [root@cosa-devsh tmp]# zstd -19 -T0 /var/tmp/initrd.img -o /var/tmp/initrd.img.zstd [root@cosa-devsh tmp]# ls -lh /var/tmp/initrd* -rw-r--r--. 1 root root 140M Jul 7 20:55 /var/tmp/initrd.img -rw-r--r--. 1 root root 77M Jul 7 20:55 /var/tmp/initrd.img.gz -rw-r--r--. 1 root root 68M Jul 7 20:51 /var/tmp/initrd.img.zstd
So that's a 9M * 3 = 27M of savings.
Though it's currently only supported in the Fedora kernel. There's also bzip2 and xz, but the decompression lag on boot might not be acceptable.
We should also consider xz or a "stronger" compression level for zstd as the decompression time may be negligible considering the size of the initramfs.
But even if we temporarily fix this issue this way, I think that it's worth giving us a little bit of room by figuring out a way for new installs to use 512M/1G boot partitions. I don't see the size of the kernel or the initrd going down in the future.
I don't see the size of the kernel or the initrd going down in the future.
This thread above lists a bunch of stuff we can do to shrink the initramfs. A notable benefit of doing so is booting becomes faster.
I like those ideas too and we need them to solve the situation for existing installations. However I don't know how much time we will need to make them happen or how complex they will end up to be. Thus if it's reasonably doable to bump the defaults sizes, I think we should consider it too.
I think we're agreeing here.
I would say though that we should split this into two issues:
Conflating them is not good because again I think we must make existing installs work.
Evaluating bumping the size for the future
It seems like matching the default from Anaconda would make sense here; it's a huge leap from 384M to 1G, but provides plenty of future proofing.
I don't know how that size would play with the non-x86 arches (though a quick look at a RHEL 8.6 ppc64le
system shows it is using 1G for /boot
)
it's a huge leap from 384M to 1G
yes, it is.
I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having /boot/
take up 1G is going to be a bit against that.
I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having
/boot/
take up 1G is going to be a bit against that.
Fair. Maybe it becomes a knob that we can implement, where RHCOS can use a larger size for /boot
?
I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having
/boot/
take up 1G is going to be a bit against that.
As a counter-argument, Fedora IoT appears to default to 1G for /boot
as well
I ran some tests on my system:
Algorithm | initrd size (MiB) | Approx. decompression time at boot (ms) |
---|---|---|
gzip (-9) | 104 | 850 |
xz (-6) | 88 | 3250 |
zstd (-15) | 94 | 330 |
zstd -19 | 88 | 400 |
The latter options would also require us to ship zstd
, which is 1.5 MiB. But overall it seems advisable to switch to zstd on FCOS and RHCOS 9.
(Sadly, it's never that simple. coreos-installer pxe customize
assumes it can parse the entire initrd, so it'd need zstd support. There's no pure Rust zstd implementation, but the zstd
crate can optionally statically link with its own bundled copy of libzstd. In RHCOS we bundle Rust dependencies but not C ones, so the coreos-installer binary on mirror.openshift.com might need to gain a libzstd dependency.)
We wouldn't necessarily have to bump the bootfs all the way to 1 GB. Even if we doubled it, that'd only be 768 MiB. Yes, we'd need to update the Butane template as well, but that should be a small change and there isn't a strong dependency between those steps in either direction.
I have serious concerns about the maintainability of our initramfs as is. There are maybe only a couple people who fully understand how it works. The PoC in https://github.com/coreos/fedora-coreos-config/pull/1834 is reasonably small, but I wonder if there will be further consequences for maintenance, such as complicating debugging. That might be worthwhile if gave us a substantial amount of headroom, but the current PoC saves just 7% of the bootfs, and of course it doesn't reduce our shipping weight at all. We can't further improve this by kicking out NetworkManager and network drivers, since we need those for Tang-bound disk encryption. And I'm not convinced saving 27 MB is worth any increase in initrd complexity. (@cgwalters, your arguments in https://github.com/cgwalters/cargo-vendor-filterer/issues/22 seem relevant here?)
Agree that this is two different issues. From the options we have here, it appears that working on zstd compression and increasing the size of /boot
(optionally for RHCOS only) are the lowest risk, lowest effort right now and would not be "just a temporary hack".
For existing installations, the ostree change might be our only reasonably safe option but it would need to be backported to existing systems first before they can update to a version with a bigger kernel.
And I'm not convinced saving 27 MB is worth any increase in initrd complexity. (@cgwalters, your arguments in https://github.com/cgwalters/cargo-vendor-filterer/issues/22 seem relevant here?)
I have a rather strong opinion that we must support in-place upgrades for the next (let's say) two years with our current 384M /boot
.
Functional automatic upgrades is a core raison d'être for us - it has to work.
The discussion in that issue is really about a situation almost the polar opposite in many ways; cloud storage capacity is...a lot larger.
Now I'd restate that...I think https://github.com/ostreedev/ostree/issues/2670 will tide us over for quite a while. It has some drawbacks, but...eh. I also think in the medium term (~year) that we should invest in thinning out the initramfs, and in parallel consider bumping the default size of /boot
.
For existing installations, the ostree change might be our only reasonably safe option but it would need to be backported to existing systems first before they can update to a version with a bigger kernel.
Oh...right. I had not considered this. So to properly handle this we'll need to use the cincinatti upgrade graph.
A major advantage then of shrinking the initramfs is it will Just Work without requiring any changes on existing systems.
Of the options:
Switching to zstd sounds most promising, but sucks that will make `coreos-installer pick up non-"pure rust" dep.
easiest to see the benifits, but doesn't help existing nodes.
As mentioned by others here, I'm worried here about complexity and would prefer we not do this or at least be VERY careful if we do. I'd prefer not modifying what we ship currently, but rather maybe do some postprocessing on the local node to find identical components in the multiple initramfs images and factor them out into a shared layered initrd. This itself would have risks.
This also worries me, but could probably be done safely in a way that minimizes risk and only if we need it.
Another thing to think about here.. We've got a few statically compiled binaries in our initramfs now. Any improvements in the state of the art for golang/rust in reducing the size of the static binaries that we could investigate?
Switching to zstd sounds most promising, but sucks that will make `coreos-installer pick up non-"pure rust" dep.
It turns out that RPM requires libzstd on all of Fedora, RHEL 8, and RHEL 9, so this should be a non-issue.
Filed https://github.com/coreos/fedora-coreos-tracker/issues/1251 to add zstd
to the distro.
I found an old thread on devel@lists.fp.o
where @keszybz mentioned playing around with zstd
compressed initrd images - https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/YYSL4WQ6A337UNAR3AYBWKAROJ42SBOO/#KFIAXNVCODGO2ODOZ53OM5MB3PWUQ3TY
Just tagging him for general interest
initrd zstd enablement PRs:
rather maybe do some postprocessing on the local node to find identical components in the multiple initramfs images and factor them out into a shared layered initrd.
I think that would cut against our "image based updates" principle - more local node processing introduces more risk.
But, I think we could take this to the server side - we know we support multiple initramfs images today, so if we:
/boot
Then we'd get that sharing.
Hmm. Splitting off a separate image/layer with just the kernel modules would in and of itself likely help a lot.
I don't know the limits on how many initramfs images we can have.
(This "chunking" discussion of course heavily overlaps with/parallels https://github.com/ostreedev/ostree-rs-ext/issues/69 - just replace "initramfs" with "container image")
It's worth noting that changing the boot partition size can have user impact. Our docs recommend sizing the root partition to 8 GiB. Increasing the boot partition size will shift the root partition down. This can cause data loss for anyone who regularly reprovisions their systems with newer install media expecting data filesystems on partitions after the root partition to be recreated. So we should announce this change well in advance if we do it.
It might also be good to change our docs to stop using size_mib
for the root partition and instead recommend start_mib
on any following partition with a healthy margin so it's offset-based instead of size-based.
I started reading a couple of these threads because I am thinking about how we can do a developer mode/compliance mode (particularly the factory reset part and the removal of protected 3rd party code). I do like the thoughts of different initrd's for difference use cases as we pay for for every byte in an initrd on boot time, whether we use it or not.
Could we also consider than 3rd parties may apply non-free software on top of our images, so there may be a need for:
/usr/lib/coreos/initramfs-firstboot-free.img // Vanilla deployment, silverblue, autosd, etc.
/usr/lib/coreos/initramfs-firstboot-non-free.img // the 3rd party stuff included
So we can remove non-free layers from our deployment when someone switches into developer mode and factory reset. And vice versa when we go back into non-developer mode, reapply the 3rd party non-free layers.
Issue for documenting how to manually increase the size of /boot
: https://github.com/coreos/fedora-coreos-docs/issues/410.
Example in https://github.com/coreos/fedora-coreos-tracker/issues/1196#issuecomment-1132428498.
We discussed this in the community meeting today.
The options discussed were:
We decided for now to:
13:32:38 dustymabe | #agreed we discussed the different options for solving
| this general problem here today. Right now we don't
| see a clear path forward for changing the /boot/
| partition size without risking data loss while
| re-provisioning systems. We're going to investigate
| the other options and also brainstorm on how we can
| increase the /boot partition in the future. For now
| we'll try to get at least the
| compression mitigation in place and move forward with
| shipping ppc64le.
But we agreed that we want to revisit changing the size of the /boot/ partition soon. To further this conversation we need to enumerate the various scenarios a user could be in such that we can talk about potential solutions and run them through each scenario to verify we don't end up in a data loss situation.
We discussed this in the community meeting today.
The zstd change didn't give us enough space savings to effectively remove our concerns for ppc64le
(see update) so we revisited this today to discuss again the short term solutions. For now we're going to pursue:
ppc64le
initramfs is larger than say x86_64
or aarch64
rpm-ostree
to opportunistically cleanup rollback deployment if neededWe also discussed writing a test to warn us when our initramfs files go above a certain size so as to warn us of potential out of space issues here.
Heads up: https://bugzilla.redhat.com/show_bug.cgi?id=2144113#c10 the GRUB binary might grow in size
- investigating why
ppc64le
initramfs is larger than sayx86_64
oraarch64
Done! Apparently it's not really the initramfs that's larger at all. The compressed file is only slightly larger than on say x86_64
. It's actually the kernel on ppc64le
that is the difference. From 37.20221215.20.0
:
x86_64
:$ ls -lh /boot/ostree/fedora-coreos-cbe65104658d968ba9257535af887e57369292356928a2f0aa19de9183ac9e9e/
total 86M
-rw-r--r--. 1 root root 74M Dec 15 22:20 initramfs-6.0.12-300.fc37.x86_64.img
-rwxr-xr-x. 1 root root 13M Dec 15 22:20 vmlinuz-6.0.12-300.fc37.x86_64
ppc64le
:$ ls -lh /boot/ostree/fedora-coreos-6ccef70b6f4af574fc3b2486258f527111f66dee29e9004b5a18fa97332c04c5/
total 113M
-rw-r--r--. 1 root root 70M Dec 15 23:25 initramfs-6.0.12-300.fc37.ppc64le.img
-rwxr-xr-x. 1 root root 43M Dec 15 23:25 vmlinuz-6.0.12-300.fc37.ppc64le
So the kernel is a whole 30M
larger on ppc64le
.
So the kernel is a whole
30M
larger onppc64le
.
Opened BZ#2154939 as requested by @sharkcz
We discussed this during the community meeting last week.
12:24:58 dustymabe | #action jmarrero to discuss option 2. (oportunistically
pruning older deployments if low on space) with his team
and get back to us
and
12:28:48 dustymabe | #action dustymabe to verify that booting a compressed kernel
on ppc64le works
So @jmarrero and @dustymabe will try to report back this week.
Chatted with @cgwalters about "option 2" from https://github.com/coreos/fedora-coreos-tracker/issues/1247#issuecomment-1261286872, which is captured on: https://github.com/ostreedev/ostree/issues/2670 at this point we are not considering it a viable option as it is more complex than originally understood. @cgwalters can expand if needed.
Forgot to mention there's ongoing conversation in https://bugzilla.redhat.com/show_bug.cgi?id=2154939
We discussed this today a bit more. One idea that came up is that we could have Zincati do something similar to what the MCO does: prune the rollback before finalizing. We wouldn't want to do this by default, but Zincati could do some rough space calculations and activate this if it detects that /boot
wouldn't have enough space. It would trigger on ppc64le but not on other arches.
reporting back on https://github.com/coreos/fedora-coreos-tracker/issues/1247#issuecomment-1407957924
I think the ultimate conclusion was that there wasn't a clear path to us safely shipping a compressed ppc64le
kernel. See BZ discussion.
We discussed this in the community meeting today.
12:57:53* dustymabe | #agreed It's becoming obvious that our
| limited /boot partition sizing is causing
| us increased issues over time. In the long
| term we will work towards increasing the
| /boot/ partition size. We will open a
| dedicated ticket for that work.
I opened https://github.com/coreos/fedora-coreos-tracker/issues/1465 to handle our long term "increase /boot partition size" goal.
For our short term goal (and also the long term goal for updating systems where we won't increase the size of /boot
) we'll rely on OSTree autopruning for that.
We'll let this ticket track the OSTree autopruning work and will thus be closed out when that lands.
The PR for OSTree autopruning is https://github.com/ostreedev/ostree/pull/2847
In https://github.com/coreos/fedora-coreos-tracker/issues/1247#issuecomment-1505964476, we agreed to have this ticket track the auto-pruning work and let https://github.com/coreos/fedora-coreos-tracker/issues/1465 track the longer-term of increasing the boot partition. Auto-pruning has long landed now, so let's close this.
We have a couple of cases where users hit a
error: Installing kernel: regfile copy: No space left on device
when installing a new kernel. see: https://discussion.fedoraproject.org/t/installing-custom-kernel-on-fedora-coreos/39138 & https://bugzilla.redhat.com/show_bug.cgi?id=2104619There are cases where the user can't easily resize /boot should we document/require a larger space for /boot?
Maybe a new troubleshooting entry?
Is there anything we should be doing on rpm-ostree to deal with this?