NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.66k stars 13.81k forks source link

vm build via linux-builder vm sometimes fails on macOS when copying files to nix-store-image #292737

Open pitkling opened 7 months ago

pitkling commented 7 months ago

Steps To Reproduce

Steps to reproduce the behavior:

  1. On an M1 mac, have a linux-builder vm running. I have one configured in my system configuration via
    nix.linux-builder = with pkgs; {
      enable = true;
      ephemeral = true;
      maxJobs = 4;
    };
  2. Build the following flake via nix build .#linux-builder.config.system.build.macos-builder-installer

    {
      inputs = {
        home-manager.url     = "github:nix-community/home-manager/release-23.11";
        nixpkgs.url          = "github:NixOS/nixpkgs/nixos-23.11"               ;
        nixpkgs-darwin.url   = "github:NixOS/nixpkgs/nixpkgs-23.11-darwin"      ;
        nixpkgs-unstable.url = "github:NixOS/nixpkgs/nixpkgs-unstable"          ;
      };
    
      outputs = inputs@{ self, nixpkgs, ... }: {
        linux-builder = let
          system = "aarch64";
        in nixpkgs.lib.nixosSystem {
          modules = [
            ({ lib, ... }: {
              imports = [ "${nixpkgs}/nixos/modules/profiles/macos-builder.nix" ];
              nix.registry = lib.mapAttrs (key: value: { flake = value; }) inputs;
              nixpkgs.hostPlatform = "${system}-linux";
              system.stateVersion = "23.11";
              virtualisation.host.pkgs = nixpkgs.legacyPackages."${system}-darwin";
            })
          ];
        };
      };
    }
  3. The build fails during a call to cptofs when buiding nix-store-image.drv, with the error ERROR: cptofs failed. diskSize might be too small for closure..

Build log

See this gist for the log.

Additional context

I put together the above flake as a minimal failing example. Note that most of the flake inputs are not even used, but when commenting one of them out, the flake builds again. In intermediate steps I had even more strange behavior. For example, the actual flake where I first had this issue is in a git repository. If I delete the .git directory, it builds.

Currently I would already be happy to know whether this is reproducible by other people or whether it is a problem on my side?

Notify maintainers

@roberth, @Gabriella439 Since this seemingly happens inside the macosx-builder. But not sure whether this is the actual culprit…

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
this path will be fetched (0.01 MiB download, 0.05 MiB unpacked):
  /nix/store/1g904ji4xfqzh475f4jpp4b2v6wrhg5y-stdenv-darwin
copying path '/nix/store/1g904ji4xfqzh475f4jpp4b2v6wrhg5y-stdenv-darwin' from 'https://cache.nixos.org'...
 - system: `"aarch64-darwin"`
 - host os: `Darwin 23.3.0, macOS 14.3.1`
 - multi-user?: `yes`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.18.1`
 - channels(root): `"nixpkgs"`
 - nixpkgs: `/nix/store/1w3f098kdy51qnyqfmmvvcqlq5d90jbc-source`

Add a :+1: reaction to issues you find important.

Gabriella439 commented 7 months ago

I'm guessing what is happening is that your Linux builder ran out of disk space. If you run nix-collect-garbage on the builder that should free up space. You can also deploy a Linux builder that has more disk space, although I'm not sure if nix-darwin makes that configurable or not, but the macos-builder.nix module has an option for configuring that:

https://github.com/NixOS/nixpkgs/blob/df41961bd4b7e838cb997543ea1297f4cbd7da15/nixos/modules/profiles/macos-builder.nix#L31-L36

pitkling commented 7 months ago

I'm guessing what is happening is that your Linux builder ran out of disk space. [..]

Ah, should have mentioned: That's what I thought first, but I checked during the build by logging into the linux-builder and running df -h several times during the build. This is the output with the highest disk usage, immediately before the error:

[builder@nixos:~]$ df -h
Filesystem                    Size  Used Avail Use% Mounted on
devtmpfs                      149M     0  149M   0% /dev
tmpfs                         1.5G     0  1.5G   0% /dev/shm
tmpfs                         742M  6.4M  736M   1% /run
tmpfs                         1.5G  2.5M  1.5G   1% /run/wrappers
/dev/disk/by-label/nixos       20G  5.7G   13G  31% /
certs                         461G  438G   24G  95% /etc/ssl/certs
/dev/disk/by-label/nix-store  1.3G  917M  247M  79% /nix/.ro-store
shared                        461G  438G   24G  95% /tmp/shared
xchg                          461G  438G   24G  95% /tmp/xchg
keys                          461G  438G   24G  95% /var/keys
overlay                        20G  5.7G   13G  31% /nix/store
tmpfs                         297M  4.0K  297M   1% /run/user/1000

So it seems there should be plenty of space left.

roberth commented 7 months ago

Looks like nixos/lib/make-disk-image.nix makes an estimate that's too small when producing the image for the nix store (useNixStoreImage, nixos/modules/virtualisation/qemu-vm.nix).

pitkling commented 7 months ago

Thanks @roberth for the hint to nixes/lib/make-disk-image.nix. Unfortunately, after some investigating it seems that's not the case (if I'm not overlooking something…).

To check whether it's a too small estimate of the disk image, I tried to keep the build product in order to inspect it. However, --keep-failed seems not to work with remote builders. Instead, I adapted the flake slightly by exchanging the line virtualisation.host.pkgs = nixpkgs.legacyPackages."${system}-darwin"; for virtualisation.host.pkgs = nixpkgs.legacyPackages."${system}-linux"; and copied the flake via ssh onto my remote builder. I then build a vm image directly on the builder via nix build ./#linux-builder.config.system.build.vm --keep-failed.

This fails with the same error message (ERROR: cptofs failed. diskSize might be too small for closure.). The temporary build directory contains the following:

[builder@nixos:/tmp/nix-build-nix-store-image.drv-0]$ ls -lh
total 2.1G
-rw-r--r-- 1 nixbld1 nixbld 4.1K Mar  3 10:53 env-vars
-rw-r--r-- 1 nixbld1 nixbld 3.7G Mar  3 10:57 nixos.raw
drwxr-xr-x 4 nixbld1 nixbld 4.0K Mar  3 10:53 root
drwxr-xr-x 6 nixbld1 nixbld 4.0K Mar  3 10:53 state

If I understand the code in nixos/lib/make-disk-image.nix correctly, the failing call to cptofs tries to copy the content of root/nix/store onto the disk image nixos.raw, which at that moment should be empty and have size roughly 3.5 GB. Checking the size of root/nix/store via du -hs gives

[builder@nixos:/tmp/nix-build-nix-store-image.drv-0]$ du -hs root/nix/store
2.0G    root/nix/store

So it should easily fit. Just to be sure I mounted the raw image nixos.raw and checked its free capacity via df -h:

[builder@nixos:/tmp/nix-build-nix-store-image.drv-0]$ df -h
Filesystem                    Size  Used Avail Use% Mounted on
devtmpfs                      149M     0  149M   0% /dev
tmpfs                         1.5G     0  1.5G   0% /dev/shm
tmpfs                         742M  6.4M  736M   1% /run
tmpfs                         1.5G  2.5M  1.5G   1% /run/wrappers
/dev/disk/by-label/nixos       20G  7.8G   11G  42% /
certs                         461G  436G   25G  95% /etc/ssl/certs
/dev/disk/by-label/nix-store  1.3G  917M  247M  79% /nix/.ro-store
shared                        461G  436G   25G  95% /tmp/shared
xchg                          461G  436G   25G  95% /tmp/xchg
keys                          461G  436G   25G  95% /var/keys
overlay                        20G  7.8G   11G  42% /nix/store
tmpfs                         297M  4.0K  297M   1% /run/user/1000
/dev/loop0                    3.5G  2.0G  1.4G  59% /home/builder/mnt-tmp

The last entry is for the nixos.raw image. So at the moment the build fails, the image has still more than 40% free space. Also, again this shows that the builder's disk image (5th entry, for /dev/disk/by-label/nixos) itself also still has plenty of capacity left.

Am I overlooking something? What else could cause the failure? Not sure whether it helps, but here's also the output of tune2fs -l for the nixos.raw image:

[builder@nixos:/tmp/nix-build-nix-store-image.drv-0]$ tune2fs -l nixos.raw 
tune2fs 1.47.0 (5-Feb-2023)
Filesystem volume name:   nix-store
Last mounted on:          /mnt/0000fe00
Filesystem UUID:          287cfdab-a4fe-48fb-9e77-d9dd303c6c37
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         unsigned_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              238560
Block count:              952576
Reserved block count:     47628
Overhead clusters:        35090
Free blocks:              412136
Free inodes:              0
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      465
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         7952
Inode blocks per group:   497
Flex block group size:    16
Filesystem created:       Sun Mar  3 10:53:31 2024
Last mount time:          Sun Mar  3 10:57:11 2024
Last write time:          Sun Mar  3 10:57:11 2024
Mount count:              2
Maximum mount count:      -1
Last checked:             Sun Mar  3 10:53:31 2024
Check interval:           0 (<none>)
Lifetime writes:          2465 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:         256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      4b5b9340-0387-45d4-a359-1ee21fc7afb1
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x716951f8
pitkling commented 7 months ago

After another look at the the logs above, the problem is not that nixos/lib/make-disk-image.nix underestimates the capacity but does not make sure that enough inodes are available (the output of tune2fs -l in my last comment shows that the raw image has no free inodes left).

I tested this in a local branch by adding -N $numInodes to the mkfs call ($numInodes is already computed by make-disk-image.nix). With this, the flake succeeds.

I couldn't find whose maintaining make-disk-image.nix, but @samueldr added the inodes computation a few years back, so maybe he knows whether it is a good idea to explicitly add the inodes number to the mkfs call or whether there is a better way to handle this?

roberth commented 7 months ago

Looks like the inode computation was only added for the purpose of reserving space in the block device. Not reserving them in the file system seems like an oversight, not something intentional. I'm inclined to just go ahead with -N.

roberth commented 7 months ago

Maybe add some margin? The image may be written to later in some usages of that function, e.g. block level copy on write for a VM.

pitkling commented 6 months ago

Sorry for the late reply, was busy with work. Anyway, thanks for your quick assessment, @roberth. I will prepare a corresponding pull request, incorporating some margin. The space computation also has some margin that corresponds to roughly 5% of the calculated disk usage (which includes the storage reserved for the inodes). So for consistency I'll probably just take the same margin for the number of inodes.

bamhm182 commented 5 months ago

I have a solution that might not work well for all, but it has been working well for me. Hopefully this PR will get merged soon enough to fix it properly.

I have been using the following:

  system = {
    stateVersion = "23.11";
    build.qcow2 = import "${toString modulesPath}/../lib/make-disk-image.nix" {
      inherit lib config pkgs;
      diskSize = "auto";
      additionalSpace = "20G";
      fsType = "ext4";
      format = "qcow2";
      partitionTableType = "hybrid";
    };
  };

Setting the additionalSpace to 20G seems to make it happy, and now the only time I'm getting failures is when my ssd is full. The "downside" to this approach is that the disk will be allowed to grow up to 20GB bigger than it needs to be. On the other hand, I end up using nix shell a lot, so having the extra breathing room for /nix isn't a bad thing, IMO.

pitkling commented 5 months ago

@bamhm182 It's a good workaround, especially since the disk image grows only if necessary. 🙂

I'm using PR #295874 for quite some time now and it works fine for me. Still, as I mentioned in the comments of the PR, I'm not sure whether I should add checks for other filesystems than ext4 (not sure which other filesystems support setting inodes) and whether the inode number should take the default of ext4 into account (seems sensible if images are not read-only).

I'm happy to improve the PR if someone more experienced with how make-disk-image.nix is used finds the time to answer my questions in the PR's comments.