NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

unable to run container image #8

Closed skatragadda-nygc closed 4 years ago

skatragadda-nygc commented 4 years ago

/run/pyxis directory was not created.

I tried installing enroot and pyxis as root. Based on the output command actually ran on my local system rather than in a container /usr/local/bin/enroot import docker://ubuntu [INFO] Querying registry for permission grant [INFO] Authenticating with user: [INFO] Authentication succeeded [INFO] Fetching image manifest list [INFO] Fetching image manifest [INFO] Downloading 1 missing digests...

100% 1:0=0s 775349758637aff77bf85e2ff0597e86e3e859183ef0baba8b3e8fc8d3cba51c

[INFO] Validating digest checksums...

775349758637aff77bf85e2ff0597e86e3e859183ef0baba8b3e8fc8d3cba51c: OK

[INFO] Extracting image layers...

100% 4:0=0s 7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c

[INFO] Converting whiteouts...

0% 0:4=0s 8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1 /bin/bash: enroot-aufs2ovlfs: command not found 25% 1:3=0s c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff /bin/bash: enroot-aufs2ovlfs: command not found 50% 2:2=0s 7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c /bin/bash: enroot-aufs2ovlfs: command not found 75% 3:1=0s 7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c /bin/bash: enroot-aufs2ovlfs: command not found 100% 4:0=0s 7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c

/usr/local/bin/enroot create ubuntu.sqsh [ERROR] No such file or directory: /tmp/pyxis/ubuntu.sqsh

srun -p dev --error=%j.log --container-image=ubuntu grep NAME /etc/os-release NAME="CentOS Linux" PRETTY_NAME="CentOS Linux 7 (Core)" CPE_NAME="cpe:/o:centos:centos:7"

flx42 commented 4 years ago
enroot-aufs2ovlfs: command not found

How did you install enroot?

skatragadda-nygc commented 4 years ago

doh! thought the pull was successful

ran below commands to install. I think enroot is not available in the path yum install -y git gcc make libcap libtool git clone --recurse-submodules https://github.com/NVIDIA/enroot.git cd enroot sudo make install

Edit: Installed on rpm and enroot works

enroot import docker://ubuntu [INFO] Querying registry for permission grant [INFO] Authenticating with user: [INFO] Authentication succeeded [INFO] Fetching image manifest list [INFO] Fetching image manifest [INFO] Found all digests in cache [INFO] Extracting image layers...

100% 4:0=0s 7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c

[INFO] Converting whiteouts...

100% 4:0=0s 7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c

[INFO] Creating squashfs filesystem...

Parallel mksquashfs: Using 2 processors Creating 4.0 filesystem on /path/ubuntu.sqsh, block size 131072. [=================================================================================================================================================================================/] 2639/2639 100%

Exportable Squashfs 4.0 filesystem, lzo compressed, data block size 131072 uncompressed data, compressed metadata, compressed fragments, compressed xattrs duplicates are removed Filesystem size 46823.32 Kbytes (45.73 Mbytes) 74.47% of uncompressed filesystem size (62877.76 Kbytes) Inode table size 38855 bytes (37.94 Kbytes) 37.35% of uncompressed inode table size (104027 bytes) Directory table size 32380 bytes (31.62 Kbytes) 51.64% of uncompressed directory table size (62701 bytes) Number of duplicate files found 99 Number of inodes 3163 Number of files 2410 Number of fragments 251 Number of symbolic links 180 Number of device nodes 0 Number of fifo nodes 0 Number of socket nodes 0 Number of directories 573 Number of ids (unique uids + gids) 1 Number of uids 1 root (0) Number of gids 1 root (0)

flx42 commented 4 years ago

When compiling from sources, you also need to do sudo make setcap

skatragadda-nygc commented 4 years ago

enroot start ubuntu enroot-unshare: failed to unshare user namespace: Invalid argument

I installed through rpm now.

flx42 commented 4 years ago

Check the enroot requirements here: https://github.com/NVIDIA/enroot/blob/master/doc/requirements.md On RHEL/CentOS, I believe user namespaces are not enabled by default.

skatragadda-nygc commented 4 years ago

gotcha! I did the check

Kernel configuration:

CONFIG_NAMESPACES : OK CONFIG_USER_NS : OK CONFIG_SECCOMP_FILTER : OK CONFIG_OVERLAY_FS : OK (module) CONFIG_X86_VSYSCALL_EMULATION : KO (required if glibc <= 2.13) CONFIG_VSYSCALL_EMULATE : KO (required if glibc <= 2.13) CONFIG_VSYSCALL_NATIVE : KO (required if glibc <= 2.13)

Kernel command line:

namespace.unpriv_enable=1 : KO user_namespace.enable=1 : KO vsyscall=native : KO (required if glibc <= 2.13) vsyscall=emulate : KO (required if glibc <= 2.13)

Kernel parameters:

user.max_user_namespaces : OK user.max_mnt_namespaces : OK

Extra packages:

nvidia-container-cli : KO (required for GPU support) pv : OK

skatragadda-nygc commented 4 years ago

I have enable user_namespaces

flx42 commented 4 years ago

Reboot and then check if it works now. Thanks!

skatragadda-nygc commented 4 years ago

enroot start ubuntu host:/# grep NAME /etc/os-release NAME="Ubuntu" PRETTY_NAME="Ubuntu 18.04.3 LTS" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic

enroot works fine now

pyxis plugin still doesn't work :/ srun -p dev --container-image=ubuntu grep NAME /etc/os-release NAME="CentOS Linux" PRETTY_NAME="CentOS Linux 7 (Core)" CPE_NAME="cpe:/o:centos:centos:7" slurmstepd: task_p_pre_launch: Using sched_affinity for tasks

flx42 commented 4 years ago

@stingbe do you have more logs for the srun command? Or try running with SLURM_DEBUG=2 srun...

Thanks

skatragadda-nygc commented 4 years ago

SLURM_DEBUG=2 srun -p dev --container-image=ubuntu grep NAME /etc/os-release srun: Linear node selection plugin loaded with argument 20 srun: Consumable Resources (CR) Node Selection plugin loaded with argument 20 srun: select/cons_tres loaded with argument 20 srun: Cray/Aries node selection plugin loaded srun: debug: switch Cray/Aries plugin loaded. srun: debug: switch NONE plugin loaded srun: debug: switch generic plugin loaded srun: debug: spank: opening plugin stack /etc/slurm/plugstack.conf srun: debug: /etc/slurm/plugstack.conf: 2: include "/etc/slurm/plugstack.conf.d/*" srun: debug: spank: opening plugin stack /etc/slurm/plugstack.conf.d/pyxis.conf srun: debug: spank: /etc/slurm/plugstack.conf.d/pyxis.conf:1: Loaded plugin spank_pyxis.so srun: debug: SPANK: appending plugin option "container-image" srun: debug: SPANK: appending plugin option "container-mounts" srun: debug: SPANK: appending plugin option "container-workdir" srun: debug: SPANK: appending plugin option "container-name" srun: debug: SPANK: appending plugin option "container-mount-home" srun: debug: SPANK: appending plugin option "no-container-mount-home" srun: launch Slurm plugin loaded srun: debug: mpi type = none srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=14987 srun: debug: propagating RLIMIT_NOFILE=262144 srun: debug: propagating RLIMIT_MEMLOCK=65536 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0002 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 32952 srun: debug: Entering _msg_thr_internal srun: debug: Munge authentication plugin loaded srun: jobid 149: nodes(1):`host01', cpu counts: 1(x1) srun: debug: requesting job 149, user 0, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name grep, relative 65534 srun: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: Using mpi/none srun: debug: Entering _msg_thr_create() srun: debug: initialized stdio listening socket, port 42230 srun: debug: Started IO server thread (139900488541952) srun: debug: Entering _launch_tasks srun: launching 149.0 on host host01, 1 tasks: 0 srun: route default plugin loaded srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: Node host01, 1 tasks started NAME="CentOS Linux" PRETTY_NAME="CentOS Linux 7 (Core)" CPE_NAME="cpe:/o:centos:centos:7" slurmstepd: task_p_pre_launch: Using sched_affinity for tasks srun: Received task exit notification for 1 task of step 149.0 (status=0x0000). srun: host01: task 0: Completed srun: debug: task 0 done srun: debug: IO thread exiting srun: debug: Leaving _msg_thr_internal

flx42 commented 4 years ago

Ok, so srun can see the pyxis plugin just fine. But I don't see any pyxis log coming from slurmstepd. Could be two options:

skatragadda-nygc commented 4 years ago

do you have a timeline for sbatch or salloc support?

flx42 commented 4 years ago

do you have a timeline for sbatch or salloc support?

Not really! We had some users asking for it, but with their use cases, doing srun --container-image inside the sbatch or salloc was sufficient. What is your use case?

Also, were you able to solve your problem above?

Thanks for testing!

skatragadda-nygc commented 4 years ago

I am able to solve the above pbm.

Is it possible to modify my config to pull from a gcr container registry instead of docker hub?

flx42 commented 4 years ago

Yes, for instance: srun --container-image=gcr.io#deeplearning-platform-release/tf-gpu ls You need the # here since it is a public container registry that is not DockerHub. So enroot need this # as a separator to be able to interpret the URI correctly.

skatragadda-nygc commented 4 years ago

Thank you!

How would go about connecting to private gcr?

I have gcloud auth configured. When I tried to pull it is taking as anon user and when I provided the username it is not allowing to enter the password.

lurmstepd: error: pyxis: failed to import docker image

slurmstepd: error: pyxis: printing contents of log file ...

slurmstepd: error: pyxis: [INFO] Querying registry for permission grant

slurmstepd: error: pyxis: [INFO] Authenticating with user:

slurmstepd: error: pyxis: [INFO] Authentication succeeded

slurmstepd: error: pyxis: [INFO] Fetching image manifest list

slurmstepd: error: pyxis: [INFO] Fetching image manifest


This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

flx42 commented 4 years ago

You need to setup your credentials in the format enroot expects: https://github.com/NVIDIA/enroot/blob/master/doc/cmd/import.md#description So by default it will likely be in file ~/.config/enroot/.credentials

skatragadda-nygc commented 4 years ago

I am going to start experimenting with public gcr images first. I tried importing tf-gpu with no sucess

SLURM_DEBUG=2 srun -p dev --container-image=gcr.io#deeplearning-platform-release/tf-gpu ls srun: Linear node selection plugin loaded with argument 20 srun: Consumable Resources (CR) Node Selection plugin loaded with argument 20 srun: select/cons_tres loaded with argument 20 srun: Cray/Aries node selection plugin loaded srun: debug: switch Cray/Aries plugin loaded. srun: debug: switch NONE plugin loaded srun: debug: switch generic plugin loaded srun: debug: spank: opening plugin stack /etc/slurm/plugstack.conf srun: debug: /etc/slurm/plugstack.conf: 2: include "/etc/slurm/plugstack.conf.d/*" srun: debug: spank: opening plugin stack /etc/slurm/plugstack.conf.d/pyxis.conf srun: debug: spank: /etc/slurm/plugstack.conf.d/pyxis.conf:1: Loaded plugin spank_pyxis.so srun: debug: SPANK: appending plugin option "container-image" srun: debug: SPANK: appending plugin option "container-mounts" srun: debug: SPANK: appending plugin option "container-workdir" srun: debug: SPANK: appending plugin option "container-name" srun: debug: SPANK: appending plugin option "container-mount-home" srun: debug: SPANK: appending plugin option "no-container-mount-home" srun: launch Slurm plugin loaded srun: debug: mpi type = none srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=4096 srun: debug: propagating RLIMIT_NOFILE=262144 srun: debug: propagating RLIMIT_MEMLOCK=65536 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0002 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 34103 srun: debug: Entering _msg_thr_internal srun: debug: Munge authentication plugin loaded srun: jobid 310: nodes(1):`nodename', cpu counts: 1(x1) srun: debug: requesting job 310, user 1331, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name ls, relative 65534 srun: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: Using mpi/none srun: debug: Entering _msg_thr_create() srun: debug: initialized stdio listening socket, port 36986 srun: debug: Started IO server thread (140668176021248) srun: debug: Entering _launch_tasks srun: launching 310.0 on host nodename, 1 tasks: 0 srun: route default plugin loaded srun: debug: launch returned msg_rc=0 err=0 type=8001 slurmstepd: pyxis: importing docker image ... slurmstepd: error: pyxis: child 40164 failed with error code: 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing contents of log file ... slurmstepd: error: pyxis: [INFO] Querying registry for permission grant slurmstepd: error: pyxis: [INFO] Found valid credentials in cache slurmstepd: error: pyxis: [INFO] Fetching image manifest list slurmstepd: error: pyxis: [INFO] Fetching image manifest slurmstepd: error: pyxis: [ERROR] URL https://gcr.io/v2/deeplearning-platform-release/tf-gpu/manifests/latest returned error code: 401 Unauthorized slurmstepd: error: spank: required plugin spank_pyxis.so: task_init_privileged() failed with rc=-1 slurmstepd: error: spank_task_init_privileged failed srun: Node nodename, 1 tasks started srun: Received task exit notification for 1 task of step 310.0 (status=0x0100). srun: error: nodename: task 0: Exited with exit code 1 srun: Terminating job step 310.0 srun: debug: task 0 done srun: debug: IO thread exiting srun: debug: Leaving _msg_thr_internal

flx42 commented 4 years ago

It's able to find your credentials in the file. But it seems your credentials are not accepted. Also, this seems like it is a public image, so you shouldn't need credentials.

Do you need both public and private images from gcr.io? You should test with just enroot, and report an issue against it if you believe there is an issue on how this registry is handled.

skatragadda-nygc commented 4 years ago

Our use case requires both public and private images. I stored credentials as machine gcr.io login oauth2accesstoken password $token under credentials file. Token is replaced with actual value.

enroot threw same error

enroot import docker://gcr.io#deeplearning-platform-release/tf-gpu [INFO] Querying registry for permission grant [INFO] Found valid credentials in cache [INFO] Fetching image manifest list [INFO] Fetching image manifest [ERROR] URL https://gcr.io/v2/deeplearning-platform-release/tf-gpu/manifests/latest returned error code: 401 Unauthorized