NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.36k stars 13.59k forks source link

Beat singularity-tools up to shape #177908

Open SomeoneSerge opened 2 years ago

SomeoneSerge commented 2 years ago

Issue description

I intend to start using nixpkgs' singularity-tools for hpc applications. What follows is a list of hindrances and minor annoyances that I've immediately encountered. The list is mostly for myself: I'm opening the issue to make this visible and maybe motivate people to voice ideas and comments. Cf. this read on singularity with Nix for more inspiration

CC (possibly interested) @ShamrockLee @jbedo

ShamrockLee commented 2 years ago

I would probably have more spare time after June 21.

To speed up the merging of #158486, I would prefer limiting the scope of this PR to:

  1. Support multiple sources and non-vendored source build while
  2. Not to break the previous functionality of singularity, singularity.nix and singularity-tools.

Further improvements can be done in successive PRs.

jbedo commented 2 years ago

I agree with limiting the scope of the PR, I'll have time to help in a couple of weeks.

ShamrockLee commented 2 years ago

[ ] Annoyance: we can compute diskSize from the built contents instead of choosing an arbitrary constant

Is there a way to compute diskSize from contents at eval time with no IFD?

ShamrockLee commented 2 years ago

Regarding the singularity-tools, a significant problem is the closure size being doubled unnecessarily by mkLayer.

singularity-tools.mkLayer generates all a new derivation by copying all the files and directories of each package into "$out", and then we use writeReferencesToFile to get the list of derivations in the dependency tree of the generated layer package.

Why don't we get a list of references of all the packages directly?

Here's my implementation which merges the writeReferencesToFiles result of all the packages in the list while removing the duplication. There should be a better implementation than the O(n^2) duplication removal, but it's much faster than the O(n) content copying of mkLayer anyway.

{
  writeMultipleReferencesToFile = paths: runCommand "runtime-deps-multiple" {
    referencesFiles = map writeReferencesToFile paths;
  } ''
    touch "$out"
    declare -a paths=();
    for refFile in $referencesFiles; do
      while read path; do
        isPathIncluded=0
        for pathIncluded in "''${paths[@]}"; do
          if [[ "$path" == "$pathIncluded" ]]; then
            isPathIncluded=1
            break
          fi
        done
        if (( ! isPathIncluded )); then
          echo "$path" >> "$out"
          paths+=( "$path" )
        fi
      done < "$refFile"
    done
  '';
}
SomeoneSerge commented 2 years ago

Is there a way to compute diskSize from contents at eval time with no IFD?

I cannot say what's a good way to compute it, but the trivial baseline is a derivation that takes, in buildInputs, a buildEnv with to-be image's contents, and dus it. The output times some constant is an upperbound on the diskSize

EDIT: i.e. we wouldn't know diskSize at nix eval time, but we'd know it at build time, which appears to be sufficient

ShamrockLee commented 2 years ago

EDIT: i.e. we wouldn't know diskSize at nix eval time, but we'd know it at build time, which appears to be sufficient

Then we would no longer be able to use

vmTools.runInLinuxVM (
  runCommand {
    preVM = vmTools.createEmptyImage {
      size = diskSize;
      fullName = "${projectName}-run-disk";
    };
  } ''
    mkfs -t ext3 -b 4096 /dev/${vmTools.hd}
    mount "/dev/${vm.hd}" disk
  ''
)
SomeoneSerge commented 2 years ago

I see now. It appears that createEmptyImage never uses size at eval time, so we could rewrite it to relax the constraint https://github.com/NixOS/nixpkgs/blob/4b31cc7551cbc795e30670d09845acdeb0f41651/pkgs/build-support/vm/default.nix#L280

ShamrockLee commented 2 years ago

I see now. It appears that createEmptyImage never uses size at eval time, so we could rewrite it to relax the constraint https://github.com/NixOS/nixpkgs/blob/4b31cc7551cbc795e30670d09845acdeb0f41651/pkgs/build-support/vm/default.nix#L280

Great! dockerTools would also benefits from that.

dmadisetti commented 1 year ago

For consistency with dockerTools.buildImage it would also be nice to change contents -> copyToRoot.

@SomeoneSerge any other HPC pain points?

ShamrockLee commented 1 year ago

~I don't mind changing the interface of singularity-tools. (That would be a breaking change.)~ Sorry for not noticing the change of dockerTool.buildImage.

There's another change lineing up that builds the image through a Singularity definition (Apptainer recipe) file to make the image more declarative and the build process explainable. It could be a drop-in replacement of the current Singularity-sandbox-based implementation.

I also went on and made a generator function that turns a settings-like Nix attrset into a definition string. The parser, which does the reverse) is still work in progress.

SomeoneSerge commented 1 year ago

I'm sorry for the long absence, my priorities had shifted somewhat

@dmadisetti On the high-level I've exactly one pain-point, and that is an unsolved (underinvested) use-case:

I think I might give this a shot again. The issues I had were:

Shouldn't be hard to alleviate

SomeoneSerge commented 1 year ago

This now suggests another point, that we maybe want a buildImage that is extendable and overridable, including the possibility to override the default contents. One direction could be makeOverride, and I think there's a similar effort being undertaken for dockerTools: https://github.com/NixOS/nixpkgs/pull/208944

Another possibility is the module system with support for mkMerge/mkForce etc, similar to NixOS. This could also be a viable approach to re-implement the upstream's "definition" files in pure Nix, so as to achieve a declarative interface to buildImage.

@ShamrockLee, settings approach sounds great, I think this should feel very native in Nixpkgs. Has your work gone into any PRs yet?

ShamrockLee commented 1 year ago

@ShamrockLee, settings approach sounds great, I think this should feel very native in Nixpkgs. Has your work gone into any PRs yet?

Not yet, but I already have the implementation integrated the change into my HEP analysis workflow.

It's time to also re-think about the buildImage interface IMO.

SomeoneSerge commented 1 year ago

@ShamrockLee are you on any of the nixos matrix channels, btw?

dmadisetti commented 1 year ago

Hopefully not adding to the noise. My current workflow is making a docker tar with nix, unpacking it, and turning it singularity. A bit of a hack, but it works?

        packages.docker = pkgs.dockerTools.buildNixShellImage {                                                                                                                                                                   
          name = "pre-sif-container";                                                                                                                                                                                                 
          tag = "latest";                                                                                                                                                                                                         
          drv = devShells.default;                                                                                                                                                                                                
        }; 
       packages.singularity = pkgs.stdenv.mkDerivation {                                                                                                                                                                                       
        name = "container.sif";  
        src = .;
        installPhase = '' 
                mkdir unpack
                tar xzvf ${packages.docker}/image.tgz -C unpack
                # Singularity can't handle .gz
                tar -C unpack/ -cvf layer.tar .
               # TODO: Allow for module of user defined nightly, opposed to using src
                singularity build $out Singularity.nightly
        '';                                                                                                                                                                                                                       
      };   

Singularity.nightly containing

Bootstrap:docker-archive
From:layer.tar
....

Big fan of using the Singularity file to define hooks etc..

ShamrockLee commented 1 year ago

@ShamrockLee are you on any of the nixos matrix channels, btw?

I don't have experience using Matrix, yet. How do I join one?

SomeoneSerge commented 1 year ago

There's a bunch options, including a web client. I think you can just follow either of the links

SomeoneSerge commented 1 year ago

By the way, I was meaning to ask, why do we have to runInLinuxVM? I remember seeing @jbedo mention this allowed setting setuid flags, but I'm not sure where do we need them. I presume QEMU takes its performance toll

It's time to also re-think about the buildImage interface IMO @ShamrockLee

Oh, I'll just throw some bait in. Have you noticed https://discourse.nixos.org/t/working-group-member-search-module-system-for-packages/26574/8 and https://github.com/DavHau/drv-parts in particular?

My current workflow is making a docker tar with nix @dmadisetti

I guess your post further proves there's a use-case:)

ShamrockLee commented 1 year ago

By the way, I was meaning to ask, why do we have to runInLinuxVM?

It was not until last year that the unprivileged image-building workflow started to be implemented in the Apptainer project. The program used to assert UID == 0 when building the image.

We are closed to the unprivileged image generation with Apptainer. The remaining obstacle is its use of /var/apptainer/mnt/session as the container mount point.

See https://github.com/apptainer/apptainer/issues/215

Sylabs's Singularity fork seems to have caught up some progress in unprivileged image build, but it still expects a bunch of top-level directories /var/singularity/mnt/{container,final,overlay,session,source}, IIRC.

SomeoneSerge commented 1 year ago

It was not until last year that the unprivileged image-building workflow started to be implemented in the Apptainer project. The program used to assert UID == 0 when building the image.

I see. So, in principle, we could have run everything except ${projectName} build $out ./img outside QEMU?

ShamrockLee commented 1 year ago

I see. So, in principle, we could have run everything except ${projectName} build $out ./img outside QEMU?

It's true when it comes to the definition-based build. It won't help much, since it should be trivial in terms of resources to generate the definition file from the definition attrset.

As for the current, Apptainer-sandbox-based buildImage, I'm not sure if we could run the ushare ... lines for runAsRoot outside QEMU. (Update: Currently, runAsRootScript uses the mount --rbind-ed /nix/store, and it simply cannot just run without some kind of emulation.)

SomeoneSerge commented 1 year ago

It won't help much

I was rather wondering if we could prepare the file-tree outside qemu and somehow pack the whole batch into an ext3/squashfs image without the mount. But then again, I didn't measure, maybe that too is insignificant

posch commented 1 year ago

It won't help much

I was rather wondering if we could prepare the file-tree outside qemu and somehow pack the whole batch into an ext3/squashfs image without the mount. But then again, I didn't measure, maybe that too is insignificant

I also prefer an approach that doesn't involve creating and running virtual machines. singularity/apptainer can run filesytems in squashfs, and I use this script to create containers:

{ pkgs
, contents
, runscript ? "#!/bin/sh\nexec ${pkgs.hello}/bin/hello"
, startscript ? "#!/bin/sh\nexec ${pkgs.hello}/bin/hello"
}:
pkgs.runCommand "make-container" {} ''
  set -o pipefail
  closureInfo=${pkgs.closureInfo { rootPaths=contents ++ [pkgs.bashInteractive]; }}
  mkdir -p $out/r/{bin,etc,dev,proc,sys,usr,.singularity.d/{actions,env,libs}}
  cd $out/r
  cp -na --parents $(cat $closureInfo/store-paths) .
  touch etc/{passwd,group}
  ln -s /bin usr/
  ln -s ${pkgs.bashInteractive}/bin/bash bin/sh
  for p in ${pkgs.lib.concatStringsSep " " contents}; do
    ln -sn $p/bin/* bin/ || true
  done
  echo "${runscript}" >.singularity.d/runscript
  echo "${startscript}" >.singularity.d/startscript
  chmod +x .singularity.d/{runscript,startscript}
  cd $out
  ${pkgs.squashfsTools}/bin/mksquashfs r container.sqfs -no-hardlinks -all-root
  ''
ShamrockLee commented 1 year ago

FYI: With https://github.com/apptainer/apptainer/pull/1284, Apptainer images can be built as a derivation without a VM.

The code already works (tested with singularity-tools.buildImageFromDef from #224636 specifying buildImageFlags = [ "--resolv ${pkgs.emptyFile}" "--hosts ${pkgs.emptyFile}" ];).

The upstream maintainer expects something more general (such as --no-mount), so the current change is not likely to get accepted. Nevertheless, it proves that fully-unprivileged Apptainer image build is possible.

jbedo commented 1 year ago

I'm sorry for the long absence, my priorities had shifted somewhat

@dmadisetti On the high-level I've exactly one pain-point, and that is an unsolved (underinvested) use-case:

  • [ ] I want to use singularity to bind-mount /nix/store on a cluster that doesn't support user namespaces nor overlayfs, but has a setuid singularity binary
  • [ ] I want to ship a pre-built Nix in a singularity image
  • [ ] I want to be able to build that image using Nix, e.g. via singularity-tools.buildImage

I think I might give this a shot again. The issues I had were:

  • [ ] As I said, cluster's singularity installation doesn't come with --overlay enabled, so I have to use --bind
  • [ ] Using --bind /tmp/blah:/nix/store hides the container's /nix/store -> singularity run fails unable to locate the sym-linked sh and such.
  • [ ] Because singularity-tools.buildImage doesn't give user full control over contents, I cannot easily replace the whole thing with static coreutils and a static Nix

Shouldn't be hard to alleviate

It's a bit hacky but I think this achieves your goals:

singularity-tools.buildImage {name = "minimal-nix"; runAsRoot = "${rsync}/bin/rsync -a ${pkgsStatic.nix}/ ./";}
ShamrockLee commented 1 year ago

@ShamrockLee are you on any of the nixos matrix channels, btw?

@SomeoneSerge I finally got a Matrix account (@shamrocklee:matrix.org) and join the Nix HPC room, thanks to the Summer of Nix.

PhDyellow commented 7 months ago

I managed to get a CUDA-capable container built by adjusting memSize along with diskSize.

Running it with env vars isn't solved yet.

posch commented 1 month ago

apptainer has merged a PR that allows to use apptainer to build containers in the Nix sandbox: https://github.com/apptainer/apptainer/pull/2394

With that change, it's possible to build containers with

$ nix-build

default.nix:

{ pkgs ? import <nixpkgs> {} }:

pkgs.callPackage ./make-container.nix {
  inherit pkgs;
  contents = with pkgs; [
    busybox
    nginx
  ];
}

make-apptainer.nix

{ apptainer ? pkgs.apptainer, contents, pkgs }:
pkgs.runCommand "make-container" {} ''
  closureInfo=${pkgs.closureInfo { rootPaths = contents ++ [ pkgs.bashInteractive ]; }}
  set -x
  mkdir -p $out/r/{bin,etc,dev,proc,sys,usr,var/log}
  cd $out/r
  cp -na --parents $(cat $closureInfo/store-paths) .
  touch etc/{passwd,group,resolv.conf}
  ln -s /bin usr/
  ln -s ${pkgs.bashInteractive}/bin/bash bin/sh
  for p in ${pkgs.lib.concatStringsSep " " contents}; do
    ln -sn $p/bin/* bin/ || true
  done
  touch $out/apptainer.conf $out/resolv.conf
  export HOME=$out
  find . -ls
  ${apptainer}/bin/apptainer --config $out/apptainer.conf --debug --verbose build -B $out/resolv.conf:/etc/resolv.conf --disable-cache --fakeroot $out/container.sif $out/r
''

This copies the closure of $contents to $out/r, links all bin/* to /bin/, creates dummy apptainer.conf and resolv.conf files, and finally runs apptainer build.

ShamrockLee commented 1 month ago

Let's land #268199 by splitting it into smaller PRs. We could then add the unprivileged Apptainer image build flow as one of its reusable components.

Here's the first one: #332168

ShamrockLee commented 1 month ago

Let's land #268199 by splitting it into smaller PRs. We could then add the unprivileged Apptainer image build flow as one of its reusable components.

332437 is the second one, containing various fixes and a few deprecations.

SomeoneSerge commented 1 month ago

By the way, maybe we should consider dropping support for choosing between apptainer and singularity for building images. For one thing, I suspect we'll have to introduce a separate attribute (like _apptainer-derandomized or _siftool-derandomized; https://github.com/NixOS/nixpkgs/issues/279250) for a tool patched to leave out all the UUIDs and the timestamps, and it's probably not worth it to maintain patches for both forks...

pbsds commented 1 month ago

If the images by one can be ran by the other and are expected to do so going forward then i don't see a problem with that.

ShamrockLee commented 1 month ago

For one thing, I suspect we'll have to introduce a separate attribute (like _apptainer-derandomized or _siftool-derandomized; #279250) for a tool patched to leave out all the UUIDs and the timestamps, and it's probably not worth it to maintain patches for both forks...

How does patching Apptainer and SingularityCE (the apptainer and singularity part) make it difficult to choose between Apptainer and SingularityCE for building images (the singularity-tools part)? We could define apptainer and singularity separately if their build flow differs too much while maintaining only one singularity-tools for the command-line interface they share in common.

SomeoneSerge commented 1 month ago
 How does patching Apptainer and SingularityCE (the apptainer and singularity part) make it difficult to choose between Apptainer and SingularityCE for building images (the singularity-tools part)? 

It doesn't, it's just that why would we patch them both separately, if we only really need the patches for singularity-tools, not for the user-facing singularity?

If the images by one can be ran by the other and are expected to do so going forward then i don't see a problem with that.

We could even package siftool separately, and that could be enough...

ShamrockLee commented 1 month ago

The development would be a lot easier if the reproducible image build functionality could be implemented upstream.

We could even package siftool separately, and that could be enough...

I seems to lose track of this. What is siftool?