NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
282 stars 31 forks source link

SLURM creating container not responding #75

Closed aviaisr closed 2 years ago

aviaisr commented 2 years ago

Hi, I'm trying to run a sqsh container with srun $ srun -G 1 --mem=8G --cpus-per-gpu=8 --container-image /home/tf_cuda_qat.sqsh --container-mounts /home/avi:/home/avi --pty bash What I'm getting is the following print with no further activity. The container is not inialaized. pyxis: creating container filesystem ...

What can be the problem?

flx42 commented 2 years ago

Please run the following:

$ pgrep -a -f 'enroot|unsquashfs'

That should tell us if enroot is still active, e.g.:

$ pgrep -a -f 'enroot|unsquashfs'
354476 bash /usr/bin/enroot create --name pyxis_9035.13 /home/fabecassis/dcse-appsys+mlperf_v1.0+language_model.sqsh
354509 unsquashfs -no-progress -user-xattrs -d /tmp/enroot-data/user-11838/pyxis_9035.13 /home/fabecassis/dcse-appsys+mlperf_v1.0+language_model.sqsh

And if I look at the filesystem being created by unsquashfs, I can see it is making forward progress so on my setup everything is fine:

$ du -sh /tmp/enroot-data/user-11838/pyxis_9035.14
8.9G    /tmp/enroot-data/user-11838/pyxis_9035.14

$ du -sh /tmp/enroot-data/user-11838/pyxis_9035.14
12G     /tmp/enroot-data/user-11838/pyxis_9035.14

$ du -sh /tmp/enroot-data/user-11838/pyxis_9035.14
14G     /tmp/enroot-data/user-11838/pyxis_9035.14