DeiC-HPC / cotainr

cotainr - a user space Apptainer/Singularity container builder.
European Union Public License 1.2
20 stars 5 forks source link

Nvidia container fails to build #35

Closed rloewe closed 1 year ago

rloewe commented 1 year ago

Running this command cotainr build --base-image docker://nvcr.io/nvidia/pytorch:23.02-py3 --conda-env accelerate.yml accelerate.sif fails with an error.

name: accelerate
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - accelerate=0.18.0=pyhd8ed1ab_0
  - bzip2=1.0.8=h7f98852_4
  - ca-certificates=2022.12.7=ha878542_0
  - filelock=3.12.0=pyhd8ed1ab_0
  - gmp=6.2.1=h58526e2_0
  - gmpy2=2.1.2=py311h6a5fa03_1
  - icu=72.1=hcb278e6_0
  - jinja2=3.1.2=pyhd8ed1ab_1
  - ld_impl_linux-64=2.40=h41732ed_0
  - libblas=3.9.0=16_linux64_openblas
  - libcblas=3.9.0=16_linux64_openblas
  - libexpat=2.5.0=hcb278e6_1
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=12.2.0=h65d4601_19
  - libgfortran-ng=12.2.0=h69a702a_19
  - libgfortran5=12.2.0=h337968e_19
  - libhwloc=2.9.1=hd6dc26d_0
  - libiconv=1.17=h166bdaf_0
  - liblapack=3.9.0=16_linux64_openblas
  - libnsl=2.0.0=h7f98852_0
  - libopenblas=0.3.21=pthreads_h78a6416_3
  - libprotobuf=3.21.12=h3eb15da_0
  - libsqlite=3.40.0=h753d276_0
  - libstdcxx-ng=12.2.0=h46fd767_19
  - libuuid=2.38.1=h0b41bf4_0
  - libxml2=2.10.4=hfdac1af_0
  - libzlib=1.2.13=h166bdaf_4
  - llvm-openmp=16.0.1=h417c0b6_0
  - markupsafe=2.1.2=py311h2582759_0
  - mkl=2022.2.1=h84fe81f_16997
  - mpc=1.3.1=hfe3b2da_0
  - mpfr=4.2.0=hb012696_0
  - mpmath=1.3.0=pyhd8ed1ab_0
  - ncurses=6.3=h27087fc_1
  - networkx=3.1=pyhd8ed1ab_0
  - numpy=1.24.2=py311h8e6699e_0
  - openssl=3.1.0=h0b41bf4_0
  - packaging=23.1=pyhd8ed1ab_0
  - pip=23.1=pyhd8ed1ab_0
  - psutil=5.9.5=py311h2582759_0
  - python=3.11.3=h2755cc3_0_cpython
  - python_abi=3.11=3_cp311
  - pytorch=2.0.0=cpu_py311h410fd25_0
  - pyyaml=6.0=py311hd4cff14_5
  - readline=8.2=h8228510_1
  - setuptools=67.6.1=pyhd8ed1ab_0
  - sleef=3.5.1=h9b69904_2
  - sympy=1.11.1=pypyh9d50eac_103
  - tbb=2021.9.0=hf52228f_0
  - tk=8.6.12=h27826a3_0
  - typing_extensions=4.5.0=pyha770c72_0
  - tzdata=2023c=h71feb2d_0
  - wheel=0.40.0=pyhd8ed1ab_0
  - xz=5.2.6=h166bdaf_0
  - yaml=0.2.5=h7f98852_2
  - zstd=1.5.2=h3eb15da_6
prefix: /home/its.aau.dk/we12ec/miniconda3/envs/accelerate
Chroxvi commented 1 year ago

The error message is:

WARNING: 'nodev' mount option set on /tmp, it could be a source of failure during build process
INFO: Starting build...
INFO: Verifying bootstrap image pytorch_23.02-py3.sif
WARNING: integrity: signature not found for object group 1
WARNING: Bootstrap image could not be verified, but build will continue.
ERROR: unpackSIF failed: root filesystem extraction failed: extract command failed: WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
WARNING: Skipping mount /etc/hosts [binds]: /etc/hosts doesn't exist in container
WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container
WARNING: Skipping mount proc [kernel]: /proc doesn't exist in container
WARNING: Skipping mount /usr/local/var/singularity/mnt/session/tmp [tmp]: /tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/singularity/mnt/session/var/tmp [tmp]: /var/tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
: signal: killed
FATAL: While performing build: packer failed to pack: root filesystem extraction failed: extract command failed: WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
WARNING: Skipping mount /etc/hosts [binds]: /etc/hosts doesn't exist in container
WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container
WARNING: Skipping mount proc [kernel]: /proc doesn't exist in container
WARNING: Skipping mount /usr/local/var/singularity/mnt/session/tmp [tmp]: /tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/singularity/mnt/session/var/tmp [tmp]: /var/tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
: signal: killed
Traceback (most recent call last):
File "/home/other-repo/cotainr/bin/cotainr", line 14, in <module>
sys.exit(main())
File "/home/other-repo/cotainr/cotainr/cli.py", line 390, in main
cli.subcommand.execute()
File "/home/other-repo/cotainr/cotainr/cli.py", line 141, in execute
with container.SingularitySandbox(base_image=self.base_image) as sandbox:
File "/home/other-repo/cotainr/cotainr/container.py", line 73, in __enter__
self._subprocess_runner(
File "/home/other-repo/cotainr/cotainr/container.py", line 225, in _subprocess_runner
return util.stream_subprocess(args=args, **kwargs)
File "/home/other-repo/cotainr/cotainr/util.py", line 113, in stream_subprocess
completed_process.check_returncode()
File "/home//miniconda3/lib/python3.9/subprocess.py", line 460, in check_returncode
raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '['singularity', 'build', '--force', '--sandbox', PosixPath('/tmp/tmphxw8alnl/singularity_sandbox'), 'pytorch_23.02-py3.sif']' returned non-zero exit status 255.
Chroxvi commented 1 year ago

I am not able to reproduce this problem. I have tried on LUMI using cotainr/2023.01.1 and singularity-ce/3.11.1 and on my laptop using cotainr/main (https://github.com/DeiC-HPC/cotainr/commit/4c81aa53a8760c184a925038b34fe0be18ce4277). In both cases the container builds without problems.

Looking at the error message, I notice the two Singularity errors:

These look like problems in older versions of Singularity, e.g. https://github.com/apptainer/singularity/issues/5666 or https://github.com/apptainer/singularity/issues/5690.

@ThomasA Are you still able to reproduce this problem? If so, what versions of cotainr and apptainer/singularity are you running?

ThomasA commented 1 year ago

I was trying with Cotainr 2023.01.0. I am checking now if I can still reproduce it. Afterwards I will try 2023.02.0. I suspect that the base image I am trying somehow does not support what Cotainr/Singularity is trying to do with it?

ThomasA commented 1 year ago

I can actually build the container now with 2023.01.0. I cannot rule out entirely that I may have been using an earlier version of Cotainr initially. In any case, it does not seem to be a problem now.

Chroxvi commented 1 year ago

Good to hear that it work for you now! I will close this issue.