Boot partition can easily run out of space on upgrade

jmarrero commented 2 years ago

We have a couple of cases where users hit a error: Installing kernel: regfile copy: No space left on device when installing a new kernel. see: https://discussion.fedoraproject.org/t/installing-custom-kernel-on-fedora-coreos/39138 & https://bugzilla.redhat.com/show_bug.cgi?id=2104619

There are cases where the user can't easily resize /boot should we document/require a larger space for /boot?

Maybe a new troubleshooting entry?

Is there anything we should be doing on rpm-ostree to deal with this?

jmarrero commented 2 years ago

travier commented 2 years ago

We might want to increase the /boot partition size to 512MB. I vaguely remember that we already talked about that somewhere but I don't remember where.

travier commented 2 years ago

https://github.com/coreos/coreos-assembler/blob/9d55a20a12f54f284ece838e2ab57c703f18a20c/src/create_disk.sh#L198

jmarrero commented 2 years ago

https://github.com/coreos/coreos-assembler/pull/2965

cgwalters commented 2 years ago

A few things here.

Bumping the size isn't going to help anyone with an existing provisioned system; only newly provisioned systems will get the change.
One thing we can and probably should do on the ostree side is detect this case and proactively remove the rollback deployment at least - that on its own would probably get us out of the immediate problem
It may still be the case that we should bump to 512M but I'd like to see at least a little bit of analysis of the size of the kernel/initramfs - did they actually grow, and by how much and what's there? IOW, is the rate of growth high enough that a year from now 512M will seem small and we're in the same boat again? (Related to this, AIUI Anaconda today uses 1G for /boot for this reason)

I've had a half-baked thought/design for a while that we should split out the Ignition bits from the initramfs into a separate file that lives in e.g. /usr/lib/coreos/initramfs-firstboot.img - then on firstboot, we mount the real root, add all that stuff to the initramfs, then continue. (To make this work sanely we'd still need the systemd units in the initramfs, we're just dynamically loading the binaries). A notable benefit of such a change would be that subsequent boots should be faster because we need to decompress a much smaller initramfs.

jlebon commented 2 years ago

The FCOS & RHCOS kernel + initrd currently sit at 97M (94M for FCOS). With a 384M /boot, that's enough for 3 (kernel, initrd) pairs only, which by default libostree uses all three (staged, booted, and rollback). Even just one additional pinned deployment will blow it up. So I think even if we don't project the kernel or initrd growing by much, we should still grow the bootfs.

jlebon commented 2 years ago

I've had a half-baked thought/design for a while that we should split out the Ignition bits from the initramfs into a separate file that lives in e.g. /usr/lib/coreos/initramfs-firstboot.img - then on firstboot, we mount the real root, add all that stuff to the initramfs, then continue. (To make this work sanely we'd still need the systemd units in the initramfs, we're just dynamically loading the binaries). A notable benefit of such a change would be that subsequent boots should be faster because we need to decompress a much smaller initramfs.

Interesting idea. A simpler approach I think is to make it a layered initrd instead in /boot. I bet we could even have it be another var similar to $ignition_firstboot but which expands to e.g. initrd /boot/coreos/firstboot.img in the BLS. (That wouldn't work on s390x though.) And then delete it on firstboot. We should calculate first how much we'd actually shave off this way.

Another half-baked idea I've had is to make libostree use an OSTree repo in /boot too so we save on duplicate kernels and initrds (though in practice it wouldn't help too much since almost every FCOS/RHCOS update has a kernel update, but it'd help e.g. Silverblue and other faster moving systems).

cgwalters commented 2 years ago

I bet we could even have it be another var similar to $ignition_firstboot but which expands to e.g. initrd /boot/coreos/firstboot.img in the BLS. (That wouldn't work on s390x though.) And then delete it on firstboot.

Entirely removing the firstboot code would make it harder to factory reset or at least a "soft factory reset" that only e.g. reruns ignition files.

Another half-baked idea I've had is to make libostree use an OSTree repo in /boot too so we save on duplicate kernels and initrds (though in practice it wouldn't help too much since almost every FCOS/RHCOS update has a kernel update, but it'd help e.g. Silverblue and other faster moving systems).

Right; if the kernel changes, then the initramfs changes. We can hence only share kernels with different initramfs images. In the case where something in the initramfs changed but not the kernel, in theory at max we could shave (3-1)*kernel = ~24MB which isn't bad, but still 6% of the size of current /boot.

If we figured out how to move parts of the initramfs to /usr the space savings in /boot multiplies by 3, not 2.

So looking at e.g.:

19M usr/bin/ignition

which is 6.2M gzip'd all on its own.

Another thing we ship is

4.5M    usr/bin/afterburn

Which gzip's to 2.0M, so even just splitting out the ignition and afterburn binaries into "dynamically loaded initramfs from /usr" gets us 3*8.2=~24.6M of savings in the much more common case where ignition and afterburn versions remain the same across all 3 deployments.

Another way to say all this is...the role of the initramfs originally was just to mount the root filesystem. Us running ignition from the initramfs makes sense, but it doesn't mean ignition has to physically live in the initramfs.

In the end state our initramfs for example doesn't need to physically contain NetworkManager for example either. Or for that matter, kernel network drivers.

cgwalters commented 2 years ago

A tricky thing here though is we would obviously continue to need to ship a single initramfs image that contains ignition for the PXE case, meaning we'd need to have two initramfs images, but ideally the single one is just a concatentation of the split.

jlebon commented 2 years ago

Another thing to investigate is to compress the initrd with zstd. In a quick test:

[root@cosa-devsh tmp]# /usr/lib/dracut/skipcpio /boot/ostree/fedora-coreos-3a5adc56c49a5e3392e03cb3475e92d466680e5a0b0533768365d964bc758843/initramfs-5.18.9-200.fc36.x86_64.img > /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# gunzip -k /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# zstd -19 -T0 /var/tmp/initrd.img -o /var/tmp/initrd.img.zstd
[root@cosa-devsh tmp]# ls -lh /var/tmp/initrd*
-rw-r--r--. 1 root root 140M Jul  7 20:55 /var/tmp/initrd.img
-rw-r--r--. 1 root root  77M Jul  7 20:55 /var/tmp/initrd.img.gz
-rw-r--r--. 1 root root  68M Jul  7 20:51 /var/tmp/initrd.img.zstd

So that's a 9M * 3 = 27M of savings.

Though it's currently only supported in the Fedora kernel. There's also bzip2 and xz, but the decompression lag on boot might not be acceptable.

I bet we could even have it be another var similar to $ignition_firstboot but which expands to e.g. initrd /boot/coreos/firstboot.img in the BLS. (That wouldn't work on s390x though.) And then delete it on firstboot.

Entirely removing the firstboot code would make it harder to factory reset or at least a "soft factory reset" that only e.g. reruns ignition files.

Hmm, if supporting a soft reset is considered in scope then there's also the fact that the binaries would slowly drift from the base version. Recent discussions about factory reset though have been leaning more towards the harder dd kind.

Another way to say all this is...the role of the initramfs originally was just to mount the root filesystem. Us running ignition from the initramfs makes sense, but it doesn't mean ignition has to physically live in the initramfs.

In the end state our initramfs for example doesn't need to physically contain NetworkManager for example either. Or for that matter, kernel network drivers.

I think the idea has a lot of merit. My primary concern would be implementation complexity in an already extremely complex initramfs.

travier commented 2 years ago

Previous discussion on the topic also in https://github.com/coreos/fedora-coreos-tracker/issues/855

travier commented 2 years ago

Another thing to investigate is to compress the initrd with zstd. In a quick test:
[root@cosa-devsh tmp]# /usr/lib/dracut/skipcpio /boot/ostree/fedora-coreos-3a5adc56c49a5e3392e03cb3475e92d466680e5a0b0533768365d964bc758843/initramfs-5.18.9-200.fc36.x86_64.img > /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# gunzip -k /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# zstd -19 -T0 /var/tmp/initrd.img -o /var/tmp/initrd.img.zstd
[root@cosa-devsh tmp]# ls -lh /var/tmp/initrd*
-rw-r--r--. 1 root root 140M Jul  7 20:55 /var/tmp/initrd.img
-rw-r--r--. 1 root root  77M Jul  7 20:55 /var/tmp/initrd.img.gz
-rw-r--r--. 1 root root  68M Jul  7 20:51 /var/tmp/initrd.img.zstd
So that's a 9M * 3 = 27M of savings.

Though it's currently only supported in the Fedora kernel. There's also bzip2 and xz, but the decompression lag on boot might not be acceptable.

We should also consider xz or a "stronger" compression level for zstd as the decompression time may be negligible considering the size of the initramfs.

travier commented 2 years ago

But even if we temporarily fix this issue this way, I think that it's worth giving us a little bit of room by figuring out a way for new installs to use 512M/1G boot partitions. I don't see the size of the kernel or the initrd going down in the future.

cgwalters commented 2 years ago

I don't see the size of the kernel or the initrd going down in the future.

This thread above lists a bunch of stuff we can do to shrink the initramfs. A notable benefit of doing so is booting becomes faster.

travier commented 2 years ago

I like those ideas too and we need them to solve the situation for existing installations. However I don't know how much time we will need to make them happen or how complex they will end up to be. Thus if it's reasonably doable to bump the defaults sizes, I think we should consider it too.

cgwalters commented 2 years ago

I think we're agreeing here.

I would say though that we should split this into two issues:

Approaches to fix the current issue at hand for existing systems (which is likely https://github.com/ostreedev/ostree/issues/2670 to start, but we can hack around it with https://github.com/openshift/machine-config-operator/pull/2302#issuecomment-1179091466 )
Evaluating bumping the size for the future

Conflating them is not good because again I think we must make existing installs work.

miabbott commented 2 years ago

Evaluating bumping the size for the future

It seems like matching the default from Anaconda would make sense here; it's a huge leap from 384M to 1G, but provides plenty of future proofing.

I don't know how that size would play with the non-x86 arches (though a quick look at a RHEL 8.6 ppc64le system shows it is using 1G for /boot)

dustymabe commented 2 years ago

it's a huge leap from 384M to 1G

yes, it is.

I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having /boot/ take up 1G is going to be a bit against that.

miabbott commented 2 years ago

I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having /boot/ take up 1G is going to be a bit against that.

Fair. Maybe it becomes a knob that we can implement, where RHCOS can use a larger size for /boot?

miabbott commented 2 years ago

I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having /boot/ take up 1G is going to be a bit against that.

As a counter-argument, Fedora IoT appears to default to 1G for /boot as well

cgwalters commented 2 years ago

PoC in https://github.com/coreos/fedora-coreos-config/pull/1834

bgilbert commented 2 years ago

initrd compression

I ran some tests on my system:

Algorithm	initrd size (MiB)	Approx. decompression time at boot (ms)
gzip (-9)	104	850
xz (-6)	88	3250
zstd (-15)	94	330
zstd -19	88	400

The latter options would also require us to ship zstd, which is 1.5 MiB. But overall it seems advisable to switch to zstd on FCOS and RHCOS 9.

(Sadly, it's never that simple. coreos-installer pxe customize assumes it can parse the entire initrd, so it'd need zstd support. There's no pure Rust zstd implementation, but the zstd crate can optionally statically link with its own bundled copy of libzstd. In RHCOS we bundle Rust dependencies but not C ones, so the coreos-installer binary on mirror.openshift.com might need to gain a libzstd dependency.)

bootfs size

We wouldn't necessarily have to bump the bootfs all the way to 1 GB. Even if we doubled it, that'd only be 768 MiB. Yes, we'd need to update the Butane template as well, but that should be a small change and there isn't a strong dependency between those steps in either direction.

Moving Ignition out of the initrd

I have serious concerns about the maintainability of our initramfs as is. There are maybe only a couple people who fully understand how it works. The PoC in https://github.com/coreos/fedora-coreos-config/pull/1834 is reasonably small, but I wonder if there will be further consequences for maintenance, such as complicating debugging. That might be worthwhile if gave us a substantial amount of headroom, but the current PoC saves just 7% of the bootfs, and of course it doesn't reduce our shipping weight at all. We can't further improve this by kicking out NetworkManager and network drivers, since we need those for Tang-bound disk encryption. And I'm not convinced saving 27 MB is worth any increase in initrd complexity. (@cgwalters, your arguments in https://github.com/cgwalters/cargo-vendor-filterer/issues/22 seem relevant here?)

travier commented 1 year ago

Agree that this is two different issues. From the options we have here, it appears that working on zstd compression and increasing the size of /boot (optionally for RHCOS only) are the lowest risk, lowest effort right now and would not be "just a temporary hack".

For existing installations, the ostree change might be our only reasonably safe option but it would need to be backported to existing systems first before they can update to a version with a bigger kernel.

cgwalters commented 1 year ago

And I'm not convinced saving 27 MB is worth any increase in initrd complexity. (@cgwalters, your arguments in https://github.com/cgwalters/cargo-vendor-filterer/issues/22 seem relevant here?)

I have a rather strong opinion that we must support in-place upgrades for the next (let's say) two years with our current 384M /boot.

Functional automatic upgrades is a core raison d'être for us - it has to work.

The discussion in that issue is really about a situation almost the polar opposite in many ways; cloud storage capacity is...a lot larger.

Now I'd restate that...I think https://github.com/ostreedev/ostree/issues/2670 will tide us over for quite a while. It has some drawbacks, but...eh. I also think in the medium term (~year) that we should invest in thinning out the initramfs, and in parallel consider bumping the default size of /boot.

cgwalters commented 1 year ago

For existing installations, the ostree change might be our only reasonably safe option but it would need to be backported to existing systems first before they can update to a version with a bigger kernel.

Oh...right. I had not considered this. So to properly handle this we'll need to use the cincinatti upgrade graph.

A major advantage then of shrinking the initramfs is it will Just Work without requiring any changes on existing systems.

dustymabe commented 1 year ago

Of the options:

initrd compression

Switching to zstd sounds most promising, but sucks that will make `coreos-installer pick up non-"pure rust" dep.

bootfs size

easiest to see the benifits, but doesn't help existing nodes.

moving Ignition out of the initrd

As mentioned by others here, I'm worried here about complexity and would prefer we not do this or at least be VERY careful if we do. I'd prefer not modifying what we ship currently, but rather maybe do some postprocessing on the local node to find identical components in the multiple initramfs images and factor them out into a shared layered initrd. This itself would have risks.

modifying ostree to aggressively delete rollback deployment

This also worries me, but could probably be done safely in a way that minimizes risk and only if we need it.

dustymabe commented 1 year ago

Another thing to think about here.. We've got a few statically compiled binaries in our initramfs now. Any improvements in the state of the art for golang/rust in reducing the size of the static binaries that we could investigate?

bgilbert commented 1 year ago

Switching to zstd sounds most promising, but sucks that will make `coreos-installer pick up non-"pure rust" dep.

It turns out that RPM requires libzstd on all of Fedora, RHEL 8, and RHEL 9, so this should be a non-issue.

bgilbert commented 1 year ago

Filed https://github.com/coreos/fedora-coreos-tracker/issues/1251 to add zstd to the distro.

miabbott commented 1 year ago

I found an old thread on devel@lists.fp.o where @keszybz mentioned playing around with zstd compressed initrd images - https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/YYSL4WQ6A337UNAR3AYBWKAROJ42SBOO/#KFIAXNVCODGO2ODOZ53OM5MB3PWUQ3TY

Just tagging him for general interest

bgilbert commented 1 year ago

initrd zstd enablement PRs:

cgwalters commented 1 year ago

rather maybe do some postprocessing on the local node to find identical components in the multiple initramfs images and factor them out into a shared layered initrd.

I think that would cut against our "image based updates" principle - more local node processing introduces more risk.

But, I think we could take this to the server side - we know we support multiple initramfs images today, so if we:

taught ostree to use content addressed storage/dedup for stuff in /boot
change our build process to reproducibly "chunk" the initramfs (perhaps to start by just splitting off a few large binaries)

Then we'd get that sharing.

Hmm. Splitting off a separate image/layer with just the kernel modules would in and of itself likely help a lot.

I don't know the limits on how many initramfs images we can have.

(This "chunking" discussion of course heavily overlaps with/parallels https://github.com/ostreedev/ostree-rs-ext/issues/69 - just replace "initramfs" with "container image")

jlebon commented 1 year ago

It's worth noting that changing the boot partition size can have user impact. Our docs recommend sizing the root partition to 8 GiB. Increasing the boot partition size will shift the root partition down. This can cause data loss for anyone who regularly reprovisions their systems with newer install media expecting data filesystems on partitions after the root partition to be recreated. So we should announce this change well in advance if we do it.

It might also be good to change our docs to stop using size_mib for the root partition and instead recommend start_mib on any following partition with a healthy margin so it's offset-based instead of size-based.

ericcurtin commented 1 year ago

I started reading a couple of these threads because I am thinking about how we can do a developer mode/compliance mode (particularly the factory reset part and the removal of protected 3rd party code). I do like the thoughts of different initrd's for difference use cases as we pay for for every byte in an initrd on boot time, whether we use it or not.

Could we also consider than 3rd parties may apply non-free software on top of our images, so there may be a need for:

/usr/lib/coreos/initramfs-firstboot-free.img // Vanilla deployment, silverblue, autosd, etc.

/usr/lib/coreos/initramfs-firstboot-non-free.img // the 3rd party stuff included

So we can remove non-free layers from our deployment when someone switches into developer mode and factory reset. And vice versa when we go back into non-developer mode, reapply the 3rd party non-free layers.

travier commented 1 year ago

Issue for documenting how to manually increase the size of /boot: https://github.com/coreos/fedora-coreos-docs/issues/410. Example in https://github.com/coreos/fedora-coreos-tracker/issues/1196#issuecomment-1132428498.

dustymabe commented 1 year ago

We discussed this in the community meeting today.

The options discussed were:

change compression algorithm
- underway in https://github.com/coreos/fedora-coreos-config/pull/1844
change rpm-ostree behavior to opportunistically cleanup rollback deployment if needed
- see https://github.com/ostreedev/ostree/issues/2670
change the size of the /boot/ partition
split initramfs/rootfs binaries
- see https://github.com/coreos/fedora-coreos-tracker/issues/1247#issuecomment-1178166026

We decided for now to:

13:32:38  dustymabe | #agreed we discussed the different options for solving
                    | this general problem here today. Right now we don't
                    | see a clear path forward for changing the /boot/
                    | partition size without risking data loss while
                    | re-provisioning systems. We're going to investigate
                    | the other options and also brainstorm on how we can
                    | increase the /boot partition in the future. For now   
                    | we'll try to get at least the
                    | compression mitigation in place and move forward with
                    | shipping ppc64le.

But we agreed that we want to revisit changing the size of the /boot/ partition soon. To further this conversation we need to enumerate the various scenarios a user could be in such that we can talk about potential solutions and run them through each scenario to verify we don't end up in a data loss situation.

dustymabe commented 1 year ago

We discussed this in the community meeting today.

The zstd change didn't give us enough space savings to effectively remove our concerns for ppc64le (see update) so we revisited this today to discuss again the short term solutions. For now we're going to pursue:

investigating why ppc64le initramfs is larger than say x86_64 or aarch64
prioritize working on changing rpm-ostree to opportunistically cleanup rollback deployment if needed
brainstorming other low-complexity options for generically reducing the size of the initramfs

We also discussed writing a test to warn us when our initramfs files go above a certain size so as to warn us of potential out of space issues here.

travier commented 1 year ago

Heads up: https://bugzilla.redhat.com/show_bug.cgi?id=2144113#c10 the GRUB binary might grow in size

dustymabe commented 1 year ago

investigating why ppc64le initramfs is larger than say x86_64 or aarch64

Done! Apparently it's not really the initramfs that's larger at all. The compressed file is only slightly larger than on say x86_64. It's actually the kernel on ppc64le that is the difference. From 37.20221215.20.0:

x86_64:

$ ls -lh /boot/ostree/fedora-coreos-cbe65104658d968ba9257535af887e57369292356928a2f0aa19de9183ac9e9e/
total 86M
-rw-r--r--. 1 root root 74M Dec 15 22:20 initramfs-6.0.12-300.fc37.x86_64.img
-rwxr-xr-x. 1 root root 13M Dec 15 22:20 vmlinuz-6.0.12-300.fc37.x86_64

ppc64le:

$ ls -lh /boot/ostree/fedora-coreos-6ccef70b6f4af574fc3b2486258f527111f66dee29e9004b5a18fa97332c04c5/
total 113M
-rw-r--r--. 1 root root 70M Dec 15 23:25 initramfs-6.0.12-300.fc37.ppc64le.img
-rwxr-xr-x. 1 root root 43M Dec 15 23:25 vmlinuz-6.0.12-300.fc37.ppc64le

So the kernel is a whole 30M larger on ppc64le.

dustymabe commented 1 year ago

So the kernel is a whole 30M larger on ppc64le.

Opened BZ#2154939 as requested by @sharkcz

dustymabe commented 1 year ago

We discussed this during the community meeting last week.

12:24:58  dustymabe | #action jmarrero to discuss option 2. (oportunistically 
                      pruning older deployments if low on space) with his team
                      and get back to us

and

12:28:48  dustymabe | #action dustymabe to verify that booting a compressed kernel
                      on ppc64le works

So @jmarrero and @dustymabe will try to report back this week.

jmarrero commented 1 year ago

Chatted with @cgwalters about "option 2" from https://github.com/coreos/fedora-coreos-tracker/issues/1247#issuecomment-1261286872, which is captured on: https://github.com/ostreedev/ostree/issues/2670 at this point we are not considering it a viable option as it is more complex than originally understood. @cgwalters can expand if needed.

cgwalters commented 1 year ago

Forgot to mention there's ongoing conversation in https://bugzilla.redhat.com/show_bug.cgi?id=2154939

jlebon commented 1 year ago

We discussed this today a bit more. One idea that came up is that we could have Zincati do something similar to what the MCO does: prune the rollback before finalizing. We wouldn't want to do this by default, but Zincati could do some rough space calculations and activate this if it detects that /boot wouldn't have enough space. It would trigger on ppc64le but not on other arches.

dustymabe commented 1 year ago

reporting back on https://github.com/coreos/fedora-coreos-tracker/issues/1247#issuecomment-1407957924

I think the ultimate conclusion was that there wasn't a clear path to us safely shipping a compressed ppc64le kernel. See BZ discussion.

dustymabe commented 1 year ago

We discussed this in the community meeting today.

12:57:53*  dustymabe | #agreed It's becoming obvious that our
                     | limited /boot partition sizing is causing 
                     | us increased issues over time. In the long
                     | term we will work towards increasing the
                     | /boot/ partition size. We will open a  
                     | dedicated ticket for that work.

dustymabe commented 1 year ago

I opened https://github.com/coreos/fedora-coreos-tracker/issues/1465 to handle our long term "increase /boot partition size" goal.

For our short term goal (and also the long term goal for updating systems where we won't increase the size of /boot) we'll rely on OSTree autopruning for that.

We'll let this ticket track the OSTree autopruning work and will thus be closed out when that lands.

dustymabe commented 1 year ago

The PR for OSTree autopruning is https://github.com/ostreedev/ostree/pull/2847

jlebon commented 4 days ago

In https://github.com/coreos/fedora-coreos-tracker/issues/1247#issuecomment-1505964476, we agreed to have this ticket track the auto-pruning work and let https://github.com/coreos/fedora-coreos-tracker/issues/1465 track the longer-term of increasing the boot partition. Auto-pruning has long landed now, so let's close this.

coreos / fedora-coreos-tracker