NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
282 stars 31 forks source link

srun fails #53

Closed moonsooyoung closed 3 years ago

moonsooyoung commented 3 years ago

srun fails with no space error with /run/pyxis directory. My question is why Enroot creates squashfs under /run/pyxis directory and how to change it to enough space. The following is an example of the error:

srun --container-image=nvcr.io/nvidia/pytorch:20.12-py3 nvidia-smi

pyxis: importing docker image ... slurmstepd: error: pyxis: child 52574 failed with error code: 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing contents of log file ... slurmstepd: error: pyxis: [INFO] Querying registry for permission grant slurmstepd: error: pyxis: [INFO] Authenticating with user: $oauthtoken slurmstepd: error: pyxis: [INFO] Using credentials from file: /shared/home/cycleadmin/.config/enroot/.credentials slurmstepd: error: pyxis: [INFO] Authentication succeeded slurmstepd: error: pyxis: [INFO] Fetching image manifest list slurmstepd: error: pyxis: [INFO] Fetching image manifest slurmstepd: error: pyxis: [INFO] Found all layers in cache slurmstepd: error: pyxis: [INFO] Extracting image layers... slurmstepd: error: pyxis: [INFO] Converting whiteouts... slurmstepd: error: pyxis: [INFO] Creating squashfs filesystem... slurmstepd: error: pyxis: Write failed because No space left on device slurmstepd: error: pyxis: FATAL ERROR:Failed to write to output filesystem slurmstepd: error: pyxis: Parallel mksquashfs: Using 1 processor slurmstepd: error: pyxis: Creating 4.0 filesystem on /run/pyxis/20003/35.4.squashfs, block size 131072. slurmstepd: error: pyxis: couldn't start container slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: hpc-pg0-1: task 0: Exited with exit code 1

The runtime directories are defined in /etc/enroot/enroot.conf as follows: ENROOT_RUNTIME_PATH /mnt/enroot/$(id -u)/run ENROOT_CONFIG_PATH $HOME/.config/enroot ENROOT_CACHE_PATH /mnt/enroot/$(id -u)/.cache ENROOT_DATA_PATH /mnt/enroot/$(id -u)/.data ENROOT_TEMP_PATH /mnt/enroot

flx42 commented 3 years ago

Transferred this issue from the enroot repository. Pyxis leverages enroot, but it needs a location to store the squashfs image when it does the equivalent of this sequence of commands:

$ enroot import -o ???/debian.sqsh docker://debian
$ enroot create -n ctr ???/debian.sqsh

By default, pyxis uses /run/pyxis for ???. If your tmpfs on /run is too small, you can increase its size:

$ sudo mount -o remount,size=50% /run

Or you can point pyxis to another directory using the plugstack option runtime_path=, see https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration

moonsooyoung commented 3 years ago

Thank you for the prompt reply. Let me retry following your recommendations.

moonsooyoung commented 3 years ago

I works perfect! Thank you for your support again.

Another question is, except the import/create commands, ENROOT_RUNTIME_PATH is still used for Slurm jobs?

flx42 commented 3 years ago

except the import/create commands, ENROOT_RUNTIME_PATH is still used for Slurm jobs?

Yes.

moonsooyoung commented 3 years ago

Great. Thank you for the clarification. :-)