Closed crinavar closed 2 years ago
Sorry about the confusion, you need to use ./cont.sqsh
in order to have pyxis interpret the argument as a squashfs file instead of an image to pull from a registry.
many thanks, that was the issue! Sorry for creating a post for this small problem. I was testing right now, and wanted to ask if it is normal for it to take around 30 seconds to start the sqsh
CUDA container? (around 4.9 GBytes).
➜ 03-GPU-single git:(main) ✗ time srun --container-image=./cont.sqsh nvcc
srun: job 15720 queued and waiting for resources
srun: job 15720 has been allocated resources
nvcc fatal : No input files specified; use option --help for more information
srun: error: nodeGPU01: task 0: Exited with exit code 1
srun --container-image=./cont.sqsh nvcc 0.01s user 0.02s system 0% cpu 37.178 total
Here a group of researchers want to launch different SLURM jobs and want to share the same container by having a single sqsh
file shared. The enroot config is the following
ENROOT_RUNTIME_PATH /run/enroot/user-$(id -u)
ENROOT_CACHE_PATH /home/enroot-cache/group-$(id -g)
ENROOT_DATA_PATH /home/enroot-data/user-$(id -u)
#ENROOT_TEMP_PATH /home/enroot-tmp/user-$(id -u)
ENROOT_TEMP_PATH /tmp
ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates
ENROOT_MOUNT_HOME y
ENROOT_RESTRICT_DEV y
ENROOT_ROOTFS_WRITABLE y
Would this be a config that favors speed ? Again many thanks
You probably want to have ENROOT_DATA_PATH
point to a local filesystem, in one of our deployment we use:
ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u)
And /tmp
is a tmpfs mount, but that is only doable if you have sufficient RAM available on each node.
Also since you are storing the image in the HOME folder, it will depend how fast this filesystem is. If you have a parallel filesystem that you can use instead (e.g. Lustre, GPFS, ...), it might be a better choice.
Yes, indeed. We are using NFS and /home
is physically located in the storage node through infiniband, so we might be saturating that. We will have to re-think some of the design, but indeed moving to /tmp
would improve speed. Thanks for all the help and fast replies.
Note that /tmp
might not be a tmpfs, depending on your system setup, it might be part of the root partition, in this case it could be a slowish filesystem (e.g. RAID 1).
I see, did a fast check and in fact /tmp
is on ext4
filesystem, not tmpfs
. It is a DGX A100 node (we have just one).
➜ ~ df -T /tmp
Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/md0 ext4 1844244028 27449744 1723042068 2% /
Hi community, We are currently having an issue when running
sqsh
format containers with SLURMsrun
. For testing, i was generating an examplesqsh
CUDA container from NGC, i.e.,Then, when trying to use the container from the
sqsh
file we get the following errorAs we can see, it is not failing in the authentication phase, but in the "Fetching image manifest" part. I have already tried, with no success, the following changes:
--no-container-entrypoint
optionchmod
thecont.sqsh
file to 777The
.credentials
file is configured with the NGC part only, and doingsrun
with containers from NGC works without problem. Is there some additional authentication required when openingsqsh
files?Any help is really welcome