makeInitrdNG: malformed squashfs images when filesize exceeds 2GB

jhvst commented 1 year ago

Describe the bug

If you create initrd files over 2GB, for example, with this configuration and try to boot it, the bootup will fail when mounting the squashfs image in init-1-stage with an error squashfs error unable to read id index table. As suggested, e.g., by #26230, this is indeed caused by data corruption. However, the corruption does not happen in transport, it happens when creating the initrd. Furthermore, and what's the real bug, is that the data corruption only happens when the resulting initrd exceeds 2GB in size. The data corruption can be verified by comparing the sha256sum of the squashfs images: first on the computer that build the image (i.e., from nix store path echoed on the final parts of the build process), and then on the computer that tries to boot the image from the initrd root folder in emergency shell. The files will be exact in size, but differ in their hash signature. If one is to use the emergency shell to fetch the squashfs image from the nix store and manually initialize init-2-stage, the OS will boot successfully.

Steps To Reproduce

Steps to reproduce the behavior:

Build an initrd file which exceeds 2GB in size. For example, with my configuration run nix-build -A pix.ipxe nvidia.nix -I home-manager=https://github.com/nix-community/home-manager/archive/master.tar.gz -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/refs/heads/nixos-unstable.zip. Alternatively, you can download a pre-built version of mine.
Either modify the kernelModules in the Nix config file to include drivers necessary you to fetch the original squashfs image. This would be keyabord, filesystem drivers, and/or network drivers. Alternatively, you can download a kernel that I used.
Ensure you are running latest Linux kernel on your current system. We are going to use kexec, which only recently supports files over 2GB in size. linux_latest from nixpkgs will do.
In the result folder from step 2, you will find a file called kexec-boot. If you modified the Nix configuration file to include kernelModules for your system, you can execute this script. If you decided to use my prebuilt images, run kexec --load phasedKernel --initrd=initrd -c "boot.shell_on_fail". Then, when you are ready to halt the system, run kexec -e.
During the bootup, you should see the squashfs error described. Press f for emergency shell. Then, you will find the malformed squashfs from the root folder. Check its sha256sum. This should differ from your self-built image, or the file at http://boot.ponkila.com/squashfs.img. Bug reproduced. Done.
You can continue with the bootup process by acquiring the original squashfs image used in the built process. I store mine here. With the original squashfs acquired, delete the one found from initrd, and rename the downloaded/mounted squashfs with the same name as the original file. Then, run ./init or prepare the filesystem manually to eventually launch switch_root. You can refer to this document for more details. Finally, the OS will boot successfully.

Expected behavior

I expect it's fine if my initrd is over 2GB. Currently, it's not: I'm unable to boot. Kexec shows that this is not BIOS related issue.

Screenshots

N/A

Additional context

I wrote more details here. I have triaged this issue a lot, from cancelling out BusyBox issues, to UEFI compatibility, to RAM running out, and to kernel configuration issues with tmpfs files over 2GB. I have looked at the current implementation of the makeInitrdNG, which I believe is at least related to the bug, but I cannot see how this issue could arise from the current Rust code. This bug does not seem to be an issue with any of the userspace or kernel code, hence, it's specific to the way the images are built on Nix.

This issue is probably not very high in priority -- in practice, the 2GB limit can be circumvented by modifying the init-1-stage script to download a rootfs over the Internet, much like what is done in reproducing this bug. However, I don't think this issue should exist unless there is something awry happening in the initrd build process, hence, should probably taken a look at.

Notify maintainers

@dasJ @elvishjerricco @k900 @lheckemann

Metadata

N/A

K900 commented 1 year ago

The Rust bits don't touch the contents of the initrd at all, they're just copied and then fed to cpio. I wonder if you're hitting some cpio bug.

jhvst commented 1 year ago

Sound plausible given that cpio is quite old to my understanding. Bit more context from my triaging: the Linux kernel code for kexec had a bug/feature regarding the 2GB size due to use of the int datatype in the read_file code: https://lore.kernel.org/lkml/20220527025535.3953665-2-pasha.tatashin@soleen.com/T/

@K900 do you think this issue should be kept open until the issue is found from the dependencies, or closed as unrelated to NixOS?

K900 commented 1 year ago

I think we should at least look into this and maybe fix our cpio build to make it work.

tupakkatapa commented 6 months ago

Found a lot of threads where people discuss this and the magic value of 2GB where problems start to occur. Then I found the following, which actually makes a lot of sense:

"Downloads are placed in the 32-bit address space (i.e., below 4GB). It is very plausible that a large chunk of this address space is allocated for PCI BARs, and so you may have only ~2GB of actual RAM within this address space." https://ipxe-devel.ipxe.narkive.com/YA1ZoMfx/size-limit-in-ipxe

Unfortunately, I think it is game over regarding this; we are stuck with the squashfs method. Or what do you guys think?

lheckemann commented 6 months ago

Yeah, the only solution I see for that is a more complex setup where only a small initrd is fetched at PXE time and that initrd then obtains the Nix store squashfs (or otherwise getting the nix store -- depending on the use case nfs or similar might make sense too and allow for faster boots) from the netboot server some other way. That would require setting up networking and stuff in the initrd though.

tupakkatapa commented 6 months ago

That is exactly what we are currently using as a workaround. It required some changes to the stage-1-init script: https://github.com/majbacka-labs/nixpkgs/commits/patch-init1sh. It should be stated that this does not have anything to do with fixing kexec, which is annoying since I would like to have both functionalities for the same output format.

If you are also convinced that this does not directly relate to nixpkgs, as far as I am concerned, this issue should be closed.

jhvst commented 6 months ago

The 2GB limit has nothing to do with PXE. The limit is most likely a cpio bug.

The linked issue above to PXE forums is irrelevant in this context because: 1) the initrd is created post-boot while using 64 bit addressing, and 2) the squashfs is mounted in the init1 stage, which can use 64 bit addressing. Kexec itself supports images of over 4GB, so that is not the problem.

The problem you linked would only make sense before the kernel is started. But, the moment the kernel is started, which is always the case when you are handling squashfs, you have full access to RAM.

To fix this problem, the way the cpio is packaged, alongside its arguments, have to be reviewed.

Juuso Le 23 mars 2024 à 13:46 +0000, Jesse Karjalainen @.***>, a écrit :

That is exactly what we are currently using as a workaround. It required some changes to the stage-1-init script: https://github.com/majbacka-labs/nixpkgs/commits/patch-init1sh. It should be stated that this does not have anything to do with fixing kexec, which is annoying since I would like to have both functionalities for the same output format. If you are also convinced that this does not directly relate to nixpkgs, as far as I am concerned, this issue should be closed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

ElvishJerricco commented 6 months ago

@jhvst Isn't the argument that the initrd itself, which is loaded before the kernel starts, is too big because it contains the squashfs? i.e. The problem is not mounting the squashfs; it's loading the initrd in the first place.

jhvst commented 6 months ago

I do not think so, but I may be wrong: the shell_on_fail boot option drops you into a initrd environment, right? The bootup certainly works to this stage. Moreover, as the squashfs image can be fetched from the network and then continued successfully while in the supposed initrd environment (and we do this quite often with @tupakkatapa, as we daily-drive our patchset which implements the workaround suggestion of @lheckemann above) it seems that the initrd is actually fine, but the problem is with the squashfs file which extends over 2GB limit. It might of course be possible that the initrd is somehow pruned to an extent, but it affecting only one file but not the integrity of anything else seems unlikely to me given the current information.

FWIW, we wish to eventually upstream the changes for the workaround (see: #203750), but I guess an alternative solution would be to troubleshoot this issue. However, both options are low priority for us, while the required effort seems coincidentally high.

My current ETA is that I may take a look at this sometime in June/July, but cannot promise any resolution.

ElvishJerricco commented 6 months ago

@jhvst Look at how this initrd is created: https://github.com/NixOS/nixpkgs/blob/fd2ac5b6f33ef73d6019587eb1e7982db46eab15/nixos/modules/installer/netboot/netboot.nix#L90-L99

The ordinary initrd is used in prepend, meaning this initrd contains two compressed initrds; one is the normal one, and then after it is one containing only the squashfs.

This means that if the initrd is being truncated during load because it is too long, only the squashfs would be affected.

So I think my explanation is still very likely correct.

jhvst commented 6 months ago

Now I see, makes sense then. This might be a good test case then. I will most likely start debugging this expression. Thanks!

jhvst commented 1 month ago

It seems that this issue has now resolved itself -- initrds over 2GB boot fine. Thanks to everyone who shared their comments. This has been tested both on kexec and via ipxe netboot.

NixOS / nixpkgs