Building a SIF File From a Docker Image Is Incredibly Slow

adeandrade commented 3 years ago

Version of Singularity:

3.7.3

Expected behavior

Building a SIF file from a Docker image should take less time.

Actual behavior

It takes around 4 hours to build an image with 2 CPUs and 6 GB of RAM. The image is large (6 GB). The following warning is raised:

WARNING: 'nodev' mount option set on /scratch, it could be a source of failure during build process

Steps to reproduce this behavior

Run:

singularity \
  run \
  --containall \
  --cleanenv \
  --nv \
  --bind "${PROJECT_DIR}":/mnt \
  --workdir "${{SLURM_TMPDIR}}" \
  --home "${{SLURM_TMPDIR}}" \
  "docker://registry.hub.docker.com/adeandrade/research:mutual-information_training"

What OS/distro are you running

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

How did you install Singularity

Provided by a Slurm cluster via Lmod.

frankier commented 3 years ago

From my experience usually most of the time is taken by mksquashfs. You may also be getting rate limited by DockerHub. You can try switch to GitHub or GitLab container registry.

I have got mksquashfs to run a bit faster by downloading and building a newer version and making a wrapper script to set the amount of memory/number of CPUs. The detection of memory and CPUs by mksquashfs is particularly bad if you are trying to run in a SLURM allocation since it will try and use the whole machine rather than the allocation and e.g. end up wapping. See https://github.com/frankier/csc-tricks/#and-fixing-mksquashfs-too

Just in case you are rebuilding your container to test every change, usually you can avoid this: https://frankie.robertson.name/research/effective-cluster-computing/#use-binds

ifelsefi commented 3 years ago

Experiencing problem on all flash file systems. Local /tmp is 3X faster.

mksquashfs isn't the bottleneck as it uses all CPUs on the node. The step before, which seems to involve hashing, runs on about 20% of a single CPU core.

frankier commented 3 years ago

The step before is extracting the OCI layers I think. You should absolutely set SINGULARITY_TMPDIR to fast local scratch storage if you would be using a network filesystem otherwise. The extraction creates many small files which is very slow on typical HPC network file systems.

kmuriki commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in over 60 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

fxmarty commented 1 year ago

@frankier mksquashfs still uses only one CPU core on LUMI for me on a compute node, despite setting a custom executable binary as exec $SCRATCH/squashfs-tools/squashfs-tools/mksquashfs $@ -mem 1G -processors 16 with the latest squashfs-tools, and editing my path. Am I doing something wrong?

frankier commented 1 year ago

It has worked for me, but as ever with something a bit hacky like this it could break. You need to make sure your wrapper script is in the path before mksuqashfs. You can add add e.g. "echo 'using wrapper script'" to the beginning of your wrapper script to verify what's happening. Another thing to try is to echo the $PATH inside the sbatch script just before calling singularity pull. A more or less foolproof way of setting the path on the same line as calling singularity pull.

CSC is now using apptainer, so I'm not sure if there has been major architecture changes since the fork. Interested whether you can get it working. Also perhaps this issue would gain some traction on apptainer so the functionality could be added directly (assuming it hasn't already).

niniack commented 10 months ago

@ifelsefi I'm experiencing the same bottleneck, the storing signatures is the culprit. Did you ever find a workaround to speedup?

DrDaveD commented 10 months ago

This repository is closed. If you'd like a development team member to be involved, please run singularity --version and if it says singularity-ce submit a new issue to https://github.com/sylabs/singularity or otherwise submit a new issue to https://github.com/apptainer/apptainer.

apptainer / singularity