Low memory machines fails to intialize/boot fcos

coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS

https://fedoraproject.org/coreos/

262 stars 59 forks source link

Low memory machines fails to intialize/boot fcos #1540

Closed ghost closed 1 year ago

ghost commented 1 year ago

Describe the bug

Hi 👋

Since last stable version, coreOS does not boot anymore on AWS nano instance type. These machines have 512M of RAM.

Reproduction steps

Use any nano instance, like t3a.nano.
Fails to boot.

Expected behavior

Either fix the problem by lowering the footprint of the first fcos initialization or direct me to ways to not shadow things in RAM during initialization or be clear about expected specs for coreOS?

Things to consider:

Cloud machines are expensive. 512M physical machines are not produced anymore nowadays, but still ubiquitous in virtual/cloud envs because their prices are ok-ish.
While I could understand it’s not the main purpose, using coreOS for some kind of side workloads out of an orchestration tool is very compelling: like having low footprint HA VPNs or bastions.
fcos has been working on these nano machines for about two years in a production system.
A running fcos on a nano consumes less than 200M of RAM for base system, which is plenty of RAM for low-footprint applications. That would too bad knowing it works well even on low-spec but the init process prevent us from taking advantage of that.

Actual behavior

Relevant errors in log:

[    2.323233] Initramfs unpacking failed: write error
[…]

[    3.930319] afterburn[246]: Error: failed to run
[    3.930350] afterburn[246]: Caused by:
[    3.930371] afterburn[246]:     0: failed to write network arguments fragment
[    3.930386] afterburn[246]:     1: No space left on device (os error 28)
[    3.930599] systemd-vconsole-setup[254]: Failed to import credentials, ignoring: No such file or directory

Bigger-spec machines boots with same configuration.

System details

AWS t3a.nano Fedora CoreOS stable 38.20230722.3.0 x86_64

Butane or Ignition config

No response

Additional information

No response

dustymabe commented 1 year ago

I mentioned this in matrix, but I'll say it again here: I'm surprised it's been working this whole time (I certainly never test on a machine that small).

Did you happen to find which version in the testing-devel stream was the first to stop working?

dustymabe commented 1 year ago

FWIW I just tried booting a 512M qemu qcow image (fedora-coreos-38.20230722.3.0-qemu.x86_64.qcow2) and it booted fine:

[core@cosa-devsh ~]$ rpm-ostree status 
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Thu 2023-08-10 21:16:25 UTC)
Deployments:
● fedora:fedora/x86_64/coreos/stable
                  Version: 38.20230722.3.0 (2023-08-07T18:56:37Z)
                   Commit: bf28f852e934b0c0b9eee232a58970e96adb3e691299b02376f8719530e03fb3
             GPGSignature: Valid signature by 6A51BBABBA3D5467B6171221809A8D7CEB10B464
[core@cosa-devsh ~]$ 
[core@cosa-devsh ~]$ free -m
               total        used        free      shared  buff/cache   available
Mem:             441         152         114           2         173         275
Swap:              0           0           0

ghost commented 1 year ago

Did you happen to find which version in the testing-devel stream was the first to stop working?

Yeah, it’s 38.20230722.3.0 x86_64 exactly. Previous version works, I did the back and forth. EDIT: OK, I'll check the exact testing-devel version.

FWIW I just tried booting a 512M qemu qcow image (fedora-coreos-38.20230722.3.0-qemu.x86_64.qcow2) and it booted fine:

Interesting. Maybe assuming it’s a RAM issue was the wrong idea?

For sure: I can consistently make the boot fail with a t3a.nano, while consistently make it work with t3a.micro (notable only difference given by AWS is the RAM amount). I'll try to make more tests.

I mentioned this in matrix, but I'll say it again here: I'm surprised it's been working this whole time (I certainly never test on a machine that small).

Yes! I just want to spark a broader discussion. Because it’s a usage we have and that does not work anymore as well as the other "things to consider" I mentioned.

dustymabe commented 1 year ago

Did you happen to find which version in the testing-devel stream was the first to stop working?

Yeah, it’s 38.20230722.3.0 x86_64 exactly. Previous version works, I did the back and forth.

The testing-devel stream is our development stream where we have many builds a week. The artifacts (or in this case the reference to the AMI ID) can be picked up from the unofficial builds browser. The testing-devel build numbers look like XX.YYYYYYYY.20.Z. When you say 38.20230722.3.0 it appears you tested a stable stream build and not a testing-devel stream build.

ghost commented 1 year ago

Here are my findings:

fedora-coreos-38.20230712.20.0-x86_64 (ami-04b897868a2b0c657) : OK
fedora-coreos-38.20230712.20.1-x86_64 (ami-0e4d81169b2565fcc) : FAIL

Just to confirm it continued to fail for another testing version along the way:

fedora-coreos-38.20230714.20.0-x86_64 (ami-01424d4ce6c912ceb) : FAIL

dustymabe commented 1 year ago

The difference there was:

ignition 2.15.0-3.fc38.x86_64 → 2.16.2-1.fc38.x86_64

So maybe the size increased a lot for Ignition. You can investigate further by grabbing the RPMs from koji.

travier commented 1 year ago

~What's the disk size for that instance? Could it be https://github.com/coreos/fedora-coreos-tracker/issues/1535 ?~

Apparently that's another issue.

dustymabe commented 1 year ago

We discussed this topic in the community meeting today.

It was pointed out that Fedora does have some documentation on minimum system requirements here. That guidance currently recommends 2G+.

While it would be nice if 512M would continue to work I don't think it's worth us spending time on it. You could use Fedora Cloud image but I don't even think that would work as dnf chews through a decent amount of memory when downloading repo metadata.

dustymabe commented 1 year ago

Even with all of that said, if someone were to find the root cause of the change in behavior and propose a patch it would be considered.

ghost commented 1 year ago

Too bad I couldn’t join the meeting this morning. :(

It was pointed out that Fedora does have some documentation on minimum system requirements here. That guidance currently recommends 2G+.

I’d like to point out this is misguided. Recommending 2G+ on a user system is a low bar nowadays. This documentation even emphasize that GUI-desktop and services tend to consume a lot.

On this other side, requiring 2GB+ on any cloud environment is unreasonable. With the research of very high availability, engineers in the field tend to scale horizontally rather than vertically. Meaning they actually seek low-spec machines, but prefer having multiple of them. I mean it: a big portion of the work is actually making sure that service can run on the smallest spec possible: very small, very low footprint containers.

If Fedora CoreOS choose to follow 2GB+ minimal RAM, I believe it becomes consequently a bad choice for cloud computing. Imagine if the smallest machine in any horizontal scaling system have to be 2GB minimum… that would be a waste of energy and money.

I'm aware container orchestrators adds another layer to circumvent that issue, but still: orchestrators themselves needs services outside of them to work properly: key-value stores, secret vaults, VPN, service mesh…

This is especially a big problem for me because of the lack of guarantee. While I understand fcos might run fine with 1GB machine (or lower if original "problem" of this ticket is ever "fixed"), deciding on a 2GB minimal spec means I would receive no help and support if ever I get a problem related to fcos RAM consumption on a machine under 2GB (just to clarify this already: I do not expect anyone to solve the problem: but there is a difference between recognizing there is even a problem VS "no problem here").

Even with all of that said, if someone were to find the root cause of the change in behavior and propose a patch it would be considered.

Sadly I cannot offer much in terms of debugging besides testing on AWS.

dustymabe commented 1 year ago

Too bad I couldn’t join the meeting this morning. :(

Come join us same time next week.

travier commented 1 year ago

Like Dusty did, I tried running a QEMU image with 512MB memory and it booted:

$ cosa run --qemu-image fedora-coreos-38.20230806.1.0-qemu.x86_64.qcow2 --memory 512
[core@cosa-devsh ~]$ free -m
               total        used        free      shared  buff/cache   available
Mem:             442         174          90           2         177         254
Swap:              0           0           0
[core@cosa-devsh ~]$ rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Thu 2023-08-17 10:31:03 UTC)
Deployments:
● fedora:fedora/x86_64/coreos/next
                  Version: 38.20230806.1.0 (2023-08-07T18:56:40Z)
                   Commit: ec10f2df99e1bfd4621022f5d11950cea5395c867ce3e9a4eb2e1f5aee4cf0e5
             GPGSignature: Valid signature by 6A51BBABBA3D5467B6171221809A8D7CEB10B464

Anything under 500MB of memory failed to boot for me, likely due to the initrd not having enough space in RAM to be extracted to, leading to files missing from the initramfs and the boot process failing. If kernels running on AWS / Xen instances reserve just a slightly more memory for themselves or during boot, then we end up in AWS systems not booting with 512MB RAM.

I suspect that with the size of the initrd growing, low memory systems will be less and less supported as time goes on. Related discussions in https://github.com/coreos/fedora-coreos-tracker/issues/1465 & https://github.com/coreos/fedora-coreos-tracker/issues/1247.

Fixing this would require a significant amount of effort, but is not out of reach.

So while we very much want to support as many configurations and platforms as possible, we have to be honest upfront to our users that systems below a minimum bar might encounter issues at some point. Everyone is free to ignore those recommendations.

There is obviously no "good value" for this as everyone has a different use case. The "best we can do" values are the ones we run our tests with, because we would have a fairly good confidence that this configuration will work. If I'm not mistaken, the current default in 1GB.

ghost commented 1 year ago

Hey travier, thanks for the answer!

I'm worried regarding my second point. See, on my perspective this created some downtime and significant manpower on our end. It used to work, it does not anymore. Thus the discussion about is the minimal amount of RAM supported; but, even if I wish you could tell me 512M is the minimum "officially" supported but I understand the effort it requires is high, so it’s more of a "if it works, it works. if it does not, it does not"-stance.

So now I'm left wondering: what if I have an issue with 1GB RAM machines in the next months, is it gonna be considered a bug or not? (Maybe now because it’s your test machine size, but that’s subject to changes.) Because the answer directly impact my ability to offer stability in the system I create and maintain as well as providing the correct tool for our end goals.

Of course, I do not expect you guys to jump and solve bugs and issues unrelated to Fedora/CoreOS itself, but what if it happens - let’s say afterburn unit leaks a lot of memory but it’s hard to figure why - what will be your stance if the machine has 512M, 1GB, 2GB?

For example, microOS is clear: they support 1GB with some caveats. I didn’t test that, but if I have an issue with a microOS not booting on 1GB machine, I'm gonna assume they will fix it.

The difference here is that I can say to money holders that: "we use a system that officially runs on this type of machines, with these specs." (and therefore: if it does not anymore, everyone would expect the problem to be fixed)

So, the discussion is: can coreOS take an official stance that 1GB is the minimum supported from the time being?

dustymabe commented 1 year ago

MicroOS docs say:

minimum 1 GB physical RAM
additional memory is needed for your workload, and to use some advanced YaST installation scenarios (eg. Remote Install)

I think I read that to say: we need 1G for microOS and you add whatever memory you need (in addition to 1G) for your application. I think we fit fine in those restraints typically.

The problem that you are running into right now is that the initramfs won't unpack into 512M on that instance type. However, once the system is booted (gets past the initramfs) it runs fine with no apps in less than 512M of memory. If you don't layer any packages then 512M of memory would probably continue to update fine.

I think what I'm trying to say is:

I don't anticipate the initramfs doubling in size anytime soon such that 1G would not be sufficient.
Fedora CoreOS (with no apps and no package layering) already runs in less than 512M of RAM once booted so idle memory usage should probably be safe there too. Memory leaks would be a regression and something we'd want to fix.

So, the discussion is: can coreOS take an official stance that 1GB is the minimum supported from the time being?

I don't think we are going to make an official stance on this beyond the docs that were already linked. As @travier mentioned we already run most of our tests in VMs with 1G of memory for x86_64 at least (see code). So we'll know if we start to breach that threshold.

gdonval commented 1 year ago

Is the initramfs file always copied verbatim by grub whatever happens before launching the kernel? If so, wouldn't using compress=cat in dracut basically solve the problem? The current compressed initramfs takes ~90MB and the uncompressed image ~160MB that would lower the failing point to ~410MB.

I have also had a look at what takes space: /usr/bin/ignition takes a whopping 30MB all by itself and /usr/bin/afterburn 7MB then followed by /sbin/NetworkManager and /usr/lib64/systemd/* libs.

Ignition looks bad as I am struggling to believe that 30MB is not something that can be reduced for a program that does little technically (i.e. reading JSON and spawning external programs to do the "hard work"). Same thing, to a lesser extent, for afterburn.

Network Manager looks pretty bad too: it's 10MB of binaries redundant with included systemd libraries: adding the systemd-networkd binary and removing nm would net a 8.3MB reduction in size.

Another way of looking at the problem is at installation time.

coreos-installer could provide a switch to setup an install-only swap space (either in-file or on partition) for instance (if the kernel allows at least the compressed initramfs to be swapped out). If swapping out works as intended, the fix for AWS would just be adding a 512MB swap file/partition by default in the cloud images.

And/or it could provide a flag to uncompress initramfs on-the-fly while putting the binary in the EFI partition and drop a dracut config to set compress=cat.

travier commented 1 year ago

This is especially a big problem for me because of the lack of guarantee.

So now I'm left wondering: what if I have an issue with 1GB RAM machines in the next months, is it gonna be considered a bug or not? (Maybe now because it’s your test machine size, but that’s subject to changes.) Because the answer directly impact my ability to offer stability in the system I create and maintain as well as providing the correct tool for our end goals.

The difference here is that I can say to money holders that: "we use a system that officially runs on this type of machines, with these specs." (and therefore: if it does not anymore, everyone would expect the problem to be fixed)

So, the discussion is: can coreOS take an official stance that 1GB is the minimum supported from the time being?

Fedora CoreOS is an open source project. It does not come with any guarantee for support. We try to fix as many issues as we can but there are no guarantee that any specific issue will be fixed. We're not special here, every open source project is like that, it's written in the license.

I'm not saying that this will never be fixed or that we won't accept a PR to fix it. As I wrote in https://github.com/coreos/fedora-coreos-tracker/issues/1540#issuecomment-1682064290, fixing this is not easy (otherwise we would likely be doing it).

Instead, we're suggesting workarounds. One of those (lost to chat) is:

boot the system using a larger instance type
exclude Ignition from the initramfs using a dracut config
rebuild the initramfs
downscale the instance size and reboot

As Ignition is the largest binary, we could consider stripping it and removing debug info as we don't really expect user to debug Ignition in the initramfs: https://gophercoding.com/reduce-go-binary-size/

dustymabe commented 1 year ago

As Ignition is the largest binary, we could consider stripping it and removing debug info as we don't really expect user to debug Ignition in the initramfs: https://gophercoding.com/reduce-go-binary-size/

IIUC our binary as delivered by the RPM is already stripped and without debug_info:

$ file /usr/lib/dracut/modules.d/30ignition/ignition 
/usr/lib/dracut/modules.d/30ignition/ignition: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=c9346043c36089a4161d63c33b96b4d80ee642eb, for GNU/Linux 3.2.0, stripped

gdonval commented 1 year ago

is already stripped and without debug_info

Indeed, I have just checked.

The swap "trick" does not work (who would have thunk initramfs wasn't swappable?). Neither does uncompressing initramfs as the kernel creates a tmpfs and copies the content of the initcpio anyway (so it makes things worse).

Would you consider using upx on ignition (and a few other binaries) before inclusion to initramfs? This reduces the executable size from 28MB (on main) to 7.5MB.

More generally, compressing ignition, afterburn, nmcli, bash and NetworkManager that way reduces the uncompressed size from 156MB to 131MB, keeping the same features!

Is this where compression is set for dracut? I am asking because this is not consistent with the output of binwalk that does not see any zstd in there.

For what it's worth, recompressing with xz -e9 reduces (original) initramfs from 85.5MB to 69.0MB (not strictly equivalent because tar instead of cpio for recompression). zstd -19 still gives 72.7MB.

With the few individually compressed binaries, both zstd -19 and xz -e9 initramfs get 2-3MB bigger (74.8MB and 72.4MB).

That's a compound save of 35MB at worse.

Tangentially, this is a strong case against go on constrained systems (I would argue initramfs is). Just for reference, a statically-linked python is 30MB, stripped of its debug info, it is 5.5MB. Once upxed, it is 1.9MB and it is usable... to read json... and launch subprocesses...

There is no clear path to binary reduction in go. There is no -Os or equivalent option and apparently no interest upstream to control a bit more binary bloat. A few more tools like ignition and the minimum requirement will become much higher than any fancy desktop operating system, which then will be a problem.

dustymabe commented 1 year ago

NOTE: I wrote this response last Friday, but realize just today I never clicked to make the comment (it was in an open tab). I'm submitting it now, but some of the info may be outdated or the conversation could have moved on.

Is the initramfs file always copied verbatim by grub whatever happens before launching the kernel? If so, wouldn't using compress=cat in dracut basically solve the problem? The current compressed initramfs takes ~90MB and the uncompressed image ~160MB that would lower the failing point to ~410MB.

Yes. Using compress=cat would solve the problem (the problem being the decompression of the compressed initramfs running out of memory). But it can/will lead to other problems because our /boot/ filesystem isn't large. See https://github.com/coreos/fedora-coreos-tracker/issues/1247 and https://github.com/coreos/fedora-coreos-tracker/issues/1465

If you were to make the compress=cat change locally I imagine you'd hit some trouble eventually. Though you could experiement with using one of the other compression alorithms, which may less memory intensive during decompression.

I have also had a look at what takes space: /usr/bin/ignition takes a whopping 30MB all by itself and /usr/bin/afterburn 7MB then followed by /sbin/NetworkManager and /usr/lib64/systemd/* libs.

Ignition looks bad as I am struggling to believe that 30MB is not something that can be reduced for a program that does little technically (i.e. reading JSON and spawning external programs to do the "hard work"). Same thing, to a lesser extent, for afterburn.

This is part of the downsides of the Go and Rust programming languages. I would love to make those binaries smaller, but don't have any ideas other than a rewrite of the software, which would represent significant investment.

Network Manager looks pretty bad too: it's 10MB of binaries redundant with included systemd libraries: adding the systemd-networkd binary and removing nm would net a 8.3MB reduction in size.

We chose NM for the networking stack a long time ago. The media that we ship will continue to do so unless something significant changes.

Another way of looking at the problem is at installation time.

coreos-installer could provide a switch to setup an install-only swap space (either in-file or on partition) for instance (if the kernel allows at least the compressed initramfs to be swapped out). If swapping out works as intended, the fix for AWS would just be adding a 512MB swap file/partition by default in the cloud images.

Honestly this stuff is happening so early in boot I doubt a swap file would matter at all.

And/or it could provide a flag to uncompress initramfs on-the-fly while putting the binary in the EFI partition and drop a dracut config to set compress=cat.

dustymabe commented 1 year ago

Would you consider using upx on ignition (and a few other binaries) before inclusion to initramfs? This reduces the executable size from 28MB (on main) to 7.5MB.

Interesting. TIL about upx. Honestly I'm not really sure of the drawbacks but I feel like the reward/risk ratio might be pretty low here.

Has anyone else following this thread used it?

More generally, compressing ignition, afterburn, nmcli, bash and NetworkManager that way reduces the uncompressed size from 156MB to 131MB, keeping the same features!

Is this where compression is set for dracut? I am asking because this is not consistent with the output of binwalk that does not see any zstd in there.

Yes that should be the place it's controlled. See https://github.com/coreos/fedora-coreos-config/pull/1844 and https://github.com/coreos/fedora-coreos-tracker/issues/1247#issuecomment-1183925321. It reduced the size and reduced the amount of time to decompress.

For what it's worth, recompressing with xz -e9 reduces (original) initramfs from 85.5MB to 69.0MB (not strictly equivalent because tar instead of cpio for recompression). zstd -19 still gives 72.7MB.

With the few individually compressed binaries, both zstd -19 and xz -e9 initramfs get 2-3MB bigger (74.8MB and 72.4MB).

That's a compound save of 35MB at worse.

I'm not sure exactly what you're advocating for here. The problem we are running into is running out of memory when decompressing and extracting the initramfs. So what we need to do make sure that the decompression and extraction (both happening in memory) don't step over 512M. It's more a combination of things and not just compressed initramfs size that dictate whether we fail here.

For example, maybe the xz option makes the compressed initrd smaller, but xz is memory intensize on the decompress so it doesn't matter and we still run out of memory.

gdonval commented 1 year ago

Yes. Using compress=cat would solve the problem

Apparently though the kernel will make a copy whatever happens so the possibly "cat-compressed" archive will be put in memory first by the bootloader and the kernel will copy them over to the tmpfs-based rootfs.

but don't have any ideas other than a rewrite of the software, which would represent significant investment.

That's the spirit of my last ("tangent") comment: I know this won't be rewritten and I know there is no trivial nor not-so-trivial way to reduce go binary size. I've had a look: we are in the same boat.

All I say is at time when a new feature is discussed for implementation in Fedora, if that thing must make its way to the initramfs, I would greatly appreciate it if the issue of binary size was raised with implementers so that they consider language thoroughly. fcos runs under 200MB once fully booted, add a few more ignition-like binaries in initramfs and before you know it you'll demand 2GB+, then 4GB+.

Yes that should be the place it's controlled.

Thanks!

maybe the xz option makes the compressed initrd smaller, but xz is memory intensize on the decompress so it doesn't matter and we still run out of memory.

Excellent point: I was just thinking in terms of what is in memory at any given time, which would be {bootloader+compressed kernel/initramfs}, then {kernel + compressed initramfs + tmpfs with uncompressed initramfs} then {kernel + tmpfs}. At the moment stage stage 2 seems to be blocking (hence my little calculation) and I had assumed decompression occurred on (very) small chunks of memory but that was a baseless assumption on my part!

I need to run some tests. I would like to see how the kernel handle multi-layered initramfs (i.e. with multiple cpio archives, like we have currently for the microcodes), especially with respect to memory allocation.

Alternatively there could be a way to extract all the first boot bits into their own image that is brought up by a thinner initramfs on first boot and potentially removed once first boot succeeded. E.g. ignition and afterburn could literally be dropped into ESP by coreos-installer and deleted if successful?

cgwalters commented 1 year ago

This is part of the downsides of the Go and Rust programming languages. I would love to make those binaries smaller, but don't have any ideas other than a rewrite of the software, which would represent significant investment.

I wouldn't conflate Go and Rust in this respect. It very much depends on things, and rewriting (in what?) isn't necessarily going to make things smaller!

One concrete drawback of Go specifically is called out here https://github.com/u-root/u-root/issues/1477#issue-533334548 - and Ignition is a heavy user of reflect.

dustymabe commented 1 year ago

Yes. Using compress=cat would solve the problem

Apparently though the kernel will make a copy whatever happens so the possibly "cat-compressed" archive will be put in memory first by the bootloader and the kernel will copy them over to the tmpfs-based rootfs.

I did do some tests with compress=cat last Friday (I was stuck at a car dealership and was bored) and it did seem to help for me. Though, as mentioned in https://github.com/coreos/fedora-coreos-tracker/issues/1540#issuecomment-1687298513 this approach can/will lead to other problems because our /boot/ filesystem isn't large. See https://github.com/coreos/fedora-coreos-tracker/issues/1247 and https://github.com/coreos/fedora-coreos-tracker/issues/1465