DeiC-HPC / cotainr

cotainr - a user space Apptainer/Singularity container builder.
European Union Public License 1.2
21 stars 5 forks source link

Cotainr fails to build container when /tmp runs out of space #51

Open Chroxvi opened 10 months ago

Chroxvi commented 10 months ago

If you try to build a container using cotainr, e.g. cotainr build lumi_pytorch_rocm_demo.sif --system=lumi-g --conda-env py311_rocm542_pytorch.yml, on a system which does not have sufficient space on /tmp to store the entire container, you will encounter an error like:

...
INFO:    Creating sandbox directory...
INFO:    Build complete: /tmp/tmpobd2znt4/singularity_sandbox
Pip subprocess error:
ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device
...

Cotainr does not provide a CLI option, environment variable, or similar for changing the location of the temporary sandbox directory, created during the build phase, to another location than /tmp. This might be a problem if:

Chroxvi commented 10 months ago

If this happens on a login-node on the LUMI supercomputer, as a workaround, you can try to build your container on another node - either another login node or a compute node. To use a LUMI-C compute node, you can submit a job using srun, e.g. something along the lines of: srun --account=project_<your_project_id> --time=00:15:00 --mem=64G --cpus-per-task=32 --partition=small cotainr build lumi_pytorch_rocm_demo.sif --system=lumi-g --conda-env py311_rocm542_pytorch.yml. Note that you have to request enough memory, since /tmp is mounted as an in-memory filesystem on LUMI compute nodes. Also note that building on a compute node will consume CPUh/GPUh resources from your LUMI project.