NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

pyxis with enroot using local docker registry images failing #112

Closed karanveersingh5623 closed 1 year ago

karanveersingh5623 commented 1 year ago

Not able to run pyxis with enroot using local docker registry images .

[root@master88 ~]# SLURM_DEBUG=2 srun --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 grep PRETTY /etc/os-release
srun: select/cons_res: common_init: select/cons_res loaded
srun: select/cons_tres: common_init: select/cons_tres loaded
srun: select/linear: init: Linear node selection plugin loaded with argument 4
srun: debug:  switch/none: init: switch NONE plugin loaded
srun: debug:  spank: opening plugin stack /cm/shared/apps/slurm/var/etc/slurm/plugstack.conf
srun: debug:  /cm/shared/apps/slurm/var/etc/slurm/plugstack.conf: 1: include "/cm/shared/apps/slurm/var/etc/slurm/plugstack.conf.d/*"
srun: debug:  spank: opening plugin stack /cm/shared/apps/slurm/var/etc/slurm/plugstack.conf.d/pyxis.conf
srun: debug:  spank: /cm/shared/apps/slurm/var/etc/slurm/plugstack.conf.d/pyxis.conf:1: Loaded plugin spank_pyxis.so
srun: debug:  SPANK: appending plugin option "container-image"
srun: debug:  SPANK: appending plugin option "container-mounts"
srun: debug:  SPANK: appending plugin option "container-workdir"
srun: debug:  SPANK: appending plugin option "container-name"
srun: debug:  SPANK: appending plugin option "container-save"
srun: debug:  SPANK: appending plugin option "container-mount-home"
srun: debug:  SPANK: appending plugin option "no-container-mount-home"
srun: debug:  SPANK: appending plugin option "container-remap-root"
srun: debug:  SPANK: appending plugin option "no-container-remap-root"
srun: debug:  SPANK: appending plugin option "container-entrypoint"
srun: debug:  SPANK: appending plugin option "no-container-entrypoint"
srun: debug:  SPANK: appending plugin option "container-writable"
srun: debug:  SPANK: appending plugin option "container-readonly"
srun: launch/slurm: init: launch Slurm plugin loaded
srun: debug:  mpi type = none
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=18446744073709551615
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=255101
srun: debug:  propagating RLIMIT_NOFILE=131072
srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 33839
srun: debug:  Entering _msg_thr_internal
srun: debug:  auth/munge: init: Munge authentication plugin loaded
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: Nodes node003 are ready for job
srun: jobid 11: nodes(1):`node003', cpu counts: 1(x1)
srun: debug:  requesting job 11, user 0, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name grep, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none
srun: debug:  Entering _msg_thr_create()
srun: debug:  initialized stdio listening socket, port 35789
srun: debug:  Started IO server thread (23456203736832)
srun: debug:  Entering _launch_tasks
srun: launching StepId=11.0 on host node003, 1 tasks: 0
srun: route/default: init: route default plugin loaded
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: launch/slurm: _task_start: Node node003, 1 tasks started
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: pyxis: child 1202377 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=11.0 (status=0x0100).
srun: error: node003: task 0: Exited with exit code 1
srun: debug:  task 0 done
srun: debug:  IO thread exiting
srun: debug:  Leaving _msg_thr_internal
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      3   idle node[002-004]
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]# find / -name pyxis.conf
/cm/shared/apps/slurm/var/etc/slurm/plugstack.conf.d/pyxis.conf
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]# cat /cm/shared/apps/slurm/var/etc/slurm/plugstack.conf.d/pyxis.conf
required /cm/shared/apps/slurm/21.08.8/lib64/slurm/spank_pyxis.so runtime_path=/tmp execute_entrypoint=0 container_scope=global sbatch_support=1
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]# ll /cm/shared/apps/slurm/21.08.8/lib64/slurm/spank_pyxis.so
-rwxrwxrwx 1 root root 52064 May 10 10:31 /cm/shared/apps/slurm/21.08.8/lib64/slurm/spank_pyxis.so
karanveersingh5623 commented 1 year ago

@flx42 , please let me know if i am missing something

flx42 commented 1 year ago

Can you try with enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4?

karanveersingh5623 commented 1 year ago

@flx42 , this is Nvidia Bright Cluster Manager and I can only find enroot on Master . I have installed pyxis + enroot using cm-wlm-setup , i have used it earlier and it was working fine , dont know what I am missing .

[root@master88 enroot]# find / -name enroot
/run/user/0/enroot
/root/.cache/enroot
/root/.local/share/enroot
/cm/images/default-image-node002/etc/enroot
/cm/images/default-image-node002/usr/share/enroot
/cm/images/default-image-node002/usr/bin/enroot
/cm/images/default-image-node002/usr/lib/enroot
[root@master88 enroot]#
[root@master88 enroot]#
[root@master88 enroot]# ll /cm/images/default-image-node002/usr/bin/enroot*
-rwxr-xr-x 1 root root 17439 Nov 13  2021 /cm/images/default-image-node002/usr/bin/enroot
-rwxr-xr-x 1 root root 34728 Nov 13  2021 /cm/images/default-image-node002/usr/bin/enroot-aufs2ovlfs
-rwxr-xr-x 1 root root 20146 Nov 13  2021 /cm/images/default-image-node002/usr/bin/enroot-makeself
-rwxr-xr-x 1 root root 34728 Nov 13  2021 /cm/images/default-image-node002/usr/bin/enroot-mksquashovlfs
-rwxr-xr-x 1 root root 63400 Nov 13  2021 /cm/images/default-image-node002/usr/bin/enroot-mount
-rwxr-xr-x 1 root root 43080 Nov 13  2021 /cm/images/default-image-node002/usr/bin/enroot-nsenter
-rwxr-xr-x 1 root root 75696 Nov 13  2021 /cm/images/default-image-node002/usr/bin/enroot-switchroot
[root@master88 enroot]#
[root@master88 enroot]#
[root@master88 enroot]# /cm/images/default-image-node002/usr/bin/enroot --help
/cm/images/default-image-node002/usr/bin/enroot: line 111: /usr/lib/enroot/common.sh: No such file or directory

Looks like this is the issue , not sure , there is no path to /usr/lib/enroot/common.sh , its different

[root@master88 enroot]# ll /cm/images/default-image-node002/usr/lib/enroot/
total 60
-rw-r--r-- 1 root root 12128 Nov 13  2021 bundle.sh
-rw-r--r-- 1 root root  7222 Nov 13  2021 common.sh
-rw-r--r-- 1 root root 16070 Nov 13  2021 docker.sh
-rw-r--r-- 1 root root 22711 Nov 13  2021 runtime.sh
karanveersingh5623 commented 1 year ago

below is my worker node

[root@node004 ~]# find / -name enroot
[root@node004 ~]# find / -name enroot*
/cm/shared/apps/slurm/var/etc/enroot-sysctl.conf
/cm/shared/apps/slurm/var/etc/enroot.conf
[root@node004 ~]#
[root@node004 ~]#
[root@node004 ~]#
[root@node004 ~]# cat /cm/shared/apps/slurm/var/etc/enroot.conf
# Working directory for enroot
ENROOT_RUNTIME_PATH         /run/enroot/runtime/$(id -u)

# Directory where container layers are stored
ENROOT_CACHE_PATH           /run/enroot/cache/$(id -u)

# Directory where the filesystems of running containers are stored
ENROOT_DATA_PATH            /var/lib/enroot/data/$(id -u)

# Options passed to mksquashfs to produce container images.
ENROOT_SQUASH_OPTIONS      -noI -noD -noF -noX -no-duplicates

# Mount the current user's home directory by default.
ENROOT_MOUNT_HOME          yes

# Path to user configuration files
#ENROOT_CONFIG_PATH         ${HOME}/.config/enroot

# Restrict /dev inside the container to a minimal set of devices.
ENROOT_RESTRICT_DEV        yes

# Make the container root filesystem writable by default.
ENROOT_ROOTFS_WRITABLE     yes

# Options passed to zstd to compress digest layers.
ENROOT_ZSTD_OPTIONS        -1

# Number of times network operations should be retried.
ENROOT_TRANSFER_RETRIES    5

# Maximum time in seconds to wait for connections establishment (0 means unlimited).
ENROOT_CONNECT_TIMEOUT     60

# Maximum time in seconds to wait for network operations to complete (0 means unlimited).
ENROOT_TRANSFER_TIMEOUT    1200

# Maximum number of concurrent connections (0 means unlimited).
ENROOT_MAX_CONNECTIONS     10

# Path to library sources
#ENROOT_LIBRARY_PATH        /usr/lib/enroot

# Path to system configuration file
#ENROOT_SYSCONF_PATH        /etc/enroot

# Path to temporary directory
#ENROOT_TEMP_PATH           ${TMPDIR:-/tmp}

# Gzip program used to uncompress digest layers.
#ENROOT_GZIP_PROGRAM        gzip

# Remap the current user to root inside containers by default.
#ENROOT_REMAP_ROOT          no

# Maximum number of processors to use for parallel tasks (0 means unlimited).
#ENROOT_MAX_PROCESSORS      $(nproc)

# Use a login shell to run the container initialization.
#ENROOT_LOGIN_SHELL         yes

# Allow root to retain his superuser privileges inside containers.
#ENROOT_ALLOW_SUPERUSER     no

# Use HTTP for outgoing requests instead of HTTPS (UNSECURE!).
ENROOT_ALLOW_HTTP          yes

# Include user-specific configuration inside bundles by default.
#ENROOT_BUNDLE_ALL          no

# Generate an embedded checksum inside bundles by default.
#ENROOT_BUNDLE_CHECKSUM     no

# Always use --force on command invocations.
#ENROOT_FORCE_OVERRIDE      no

# SSL certificates settings
#SSL_CERT_DIR
#SSL_CERT_FILE

# Proxy settings
#all_proxy
#no_proxy
#http_proxy
#https_proxy
karanveersingh5623 commented 1 year ago
[root@master88 ~]# cp -r /cm/images/default-image-node002/usr/lib/enroot /usr/lib/
[root@master88 ~]# /cm/images/default-image-node002/usr/bin/enroot --help
Usage: enroot COMMAND [ARG...]

Command line utility for manipulating container sandboxes.

 Commands:
   batch  [options] [--] CONFIG [COMMAND] [ARG...]
   bundle [options] [--] IMAGE
   create [options] [--] IMAGE
   exec   [options] [--] PID COMMAND [ARG...]
   export [options] [--] NAME
   import [options] [--] URI
   list   [options]
   remove [options] [--] NAME...
   start  [options] [--] NAME|IMAGE [COMMAND] [ARG...]
   version
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]# /cm/images/default-image-node002/usr/bin/enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[ERROR] Command not found: jq
infokng commented 1 year ago

Installed enroot manually on 1 of the worker nodes , still not working

[root@node003 ~]# dnf install -y https://github.com/NVIDIA/enroot/releases/download/v3.4.1/enroot-3.4.1-1.el8.x86_64.rpm
Last metadata expiration check: 0:00:12 ago on Thu 11 May 2023 02:16:17 PM KST.
enroot-3.4.1-1.el8.x86_64.rpm                                                                                                                                                    96 kB/s | 110 kB     00:01
Dependencies resolved.
================================================================================================================================================================================================================
 Package                                          Architecture                                  Version                                               Repository                                           Size
================================================================================================================================================================================================================
Installing:
 enroot                                           x86_64                                        3.4.1-1.el8                                           @commandline                                        110 k
Installing dependencies:
 jq                                               x86_64                                        1.6-3.el8                                             appstream                                           201 k
 oniguruma                                        x86_64                                        6.8.2-2.el8                                           appstream                                           186 k
 parallel                                         noarch                                        20190922-1.el8                                        epel                                                351 k

Transaction Summary
================================================================================================================================================================================================================
Install  4 Packages

Total size: 849 k
Total download size: 738 k
Installed size: 2.6 M
Downloading Packages:
(1/3): oniguruma-6.8.2-2.el8.x86_64.rpm                                                                                                                                         597 kB/s | 186 kB     00:00
(2/3): jq-1.6-3.el8.x86_64.rpm                                                                                                                                                  571 kB/s | 201 kB     00:00
(3/3): parallel-20190922-1.el8.noarch.rpm                                                                                                                                       868 kB/s | 351 kB     00:00
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                                                                                           413 kB/s | 738 kB     00:01
Extra Packages for Enterprise Linux 8 - x86_64                                                                                                                                  1.6 MB/s | 1.6 kB     00:00
Importing GPG key 0x2F86D6A1:
 Userid     : "Fedora EPEL (8) <epel@fedoraproject.org>"
 Fingerprint: 94E2 79EB 8D8F 25B2 1810 ADF1 21EA 45AB 2F86 D6A1
 From       : /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-8
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                                                                                        1/1
  Installing       : parallel-20190922-1.el8.noarch                                                                                                                                                         1/4
  Installing       : oniguruma-6.8.2-2.el8.x86_64                                                                                                                                                           2/4
  Running scriptlet: oniguruma-6.8.2-2.el8.x86_64                                                                                                                                                           2/4
  Installing       : jq-1.6-3.el8.x86_64                                                                                                                                                                    3/4
  Installing       : enroot-3.4.1-1.el8.x86_64                                                                                                                                                              4/4
  Running scriptlet: enroot-3.4.1-1.el8.x86_64                                                                                                                                                              4/4
  Verifying        : jq-1.6-3.el8.x86_64                                                                                                                                                                    1/4
  Verifying        : oniguruma-6.8.2-2.el8.x86_64                                                                                                                                                           2/4
  Verifying        : parallel-20190922-1.el8.noarch                                                                                                                                                         3/4
  Verifying        : enroot-3.4.1-1.el8.x86_64                                                                                                                                                              4/4

Installed:
  enroot-3.4.1-1.el8.x86_64                          jq-1.6-3.el8.x86_64                          oniguruma-6.8.2-2.el8.x86_64                          parallel-20190922-1.el8.noarch

Complete!
[root@node003 ~]# enroot
enroot                enroot-aufs2ovlfs     enroot-makeself       enroot-mksquashovlfs  enroot-mount          enroot-nsenter        enroot-switchroot
[root@node003 ~]# enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[INFO] Querying registry for permission grant
curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number
infokng commented 1 year ago

Reinstalled the slurm cluster again on BCM ( bright cluster manager ) but still srun fails with same error , dont know whats going on :(

flx42 commented 1 year ago
/cm/images/default-image-node002/usr/bin/enroot: line 111: /usr/lib/enroot/common.sh: No such file or directory

Looks like a problem with the enroot installation from BCM. This should fail with all container images.

[root@node003 ~]# enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[INFO] Querying registry for permission grant
curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number

This looks like a problem with the local docker registry, is enroot working with DockerHub images in this setup?

karanveersingh5623 commented 1 year ago
[root@master88 ~]# cp -r /cm/images/default-image-node002/usr/lib/enroot /usr/lib/
[root@master88 ~]# /cm/images/default-image-node002/usr/bin/enroot --help
Usage: enroot COMMAND [ARG...]

Command line utility for manipulating container sandboxes.

 Commands:
   batch  [options] [--] CONFIG [COMMAND] [ARG...]
   bundle [options] [--] IMAGE
   create [options] [--] IMAGE
   exec   [options] [--] PID COMMAND [ARG...]
   export [options] [--] NAME
   import [options] [--] URI
   list   [options]
   remove [options] [--] NAME...
   start  [options] [--] NAME|IMAGE [COMMAND] [ARG...]
   version
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]# /cm/images/default-image-node002/usr/bin/enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[ERROR] Command not found: jq

I copied the files to mentioned location

infokng commented 1 year ago
/cm/images/default-image-node002/usr/bin/enroot: line 111: /usr/lib/enroot/common.sh: No such file or directory

Looks like a problem with the enroot installation from BCM. This should fail with all container images.

[root@node003 ~]# enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[INFO] Querying registry for permission grant
curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number

This looks like a problem with the local docker registry, is enroot working with DockerHub images in this setup?

@flx42 , I tried with dockerhub ubuntu , it worked . My docker registry is working fine with docker pull , why its not working with enroot , do i have to add any credentials somewhere ?

[root@node003 ~]# enroot import docker://ubuntu
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: <anonymous>
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Downloading 2 missing layers...

100% 2:0=0s dbf6a9befcdeecbb8813406afbd62ce81394e3869d84599f19f941aa5c74f3d1

[INFO] Extracting image layers...

100% 1:0=0s dbf6a9befcdeecbb8813406afbd62ce81394e3869d84599f19f941aa5c74f3d1

[INFO] Converting whiteouts...

100% 1:0=0s dbf6a9befcdeecbb8813406afbd62ce81394e3869d84599f19f941aa5c74f3d1

[INFO] Creating squashfs filesystem...

Parallel mksquashfs: Using 88 processors
Creating 4.0 filesystem on /root/ubuntu.sqsh, block size 131072.
[=============================================================================================================================================================================================|] 2931/2931 100%

Exportable Squashfs 4.0 filesystem, lzo compressed, data block size 131072
        uncompressed data, compressed metadata, compressed fragments, compressed xattrs
        duplicates are removed
Filesystem size 56606.98 Kbytes (55.28 Mbytes)
        74.29% of uncompressed filesystem size (76199.39 Kbytes)
Inode table size 42820 bytes (41.82 Kbytes)
        36.73% of uncompressed inode table size (116577 bytes)
Directory table size 35563 bytes (34.73 Kbytes)
        50.03% of uncompressed directory table size (71090 bytes)
Number of duplicate files found 130
Number of inodes 3519
Number of files 2628
Number of fragments 273
Number of symbolic links  212
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 679
Number of ids (unique uids + gids) 1
Number of uids 1
        root (0)
Number of gids 1
        root (0)
infokng commented 1 year ago

This is very strange issue , BCM is deploying successfully slurm cluster with pyxis support , then where is the problem ? I am stuck in my project and I exhausted almost all my options .

infokng commented 1 year ago

@flx42 , BCM enroot worked :)

[root@node002 ~]# enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[INFO] Querying registry for permission grant
[INFO] Permission granted
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Downloading 63 missing layers...

100% 63:0=0s e1471ee6a09e6b27237602eb8eda43e616af5f0ef9261aaab8241b33c545dfbf

[INFO] Extracting image layers...

100% 62:0=0s 35807b77a593c1147d13dc926a91dcc3015616ff7307cc30442c5a8e07546283

[INFO] Converting whiteouts...

100% 62:0=0s 35807b77a593c1147d13dc926a91dcc3015616ff7307cc30442c5a8e07546283

[INFO] Creating squashfs filesystem...

Parallel mksquashfs: Using 128 processors
Creating 4.0 filesystem on /root/+cosmoflow-nvidia+0.4.sqsh, block size 131072.
[=========================================================================================================================================================================================\] 180702/180702 100%

Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
        uncompressed data, uncompressed metadata, uncompressed fragments, uncompressed xattrs
        duplicates are not removed
Filesystem size 12786712.34 Kbytes (12487.02 Mbytes)
        99.92% of uncompressed filesystem size (12796732.07 Kbytes)
Inode table size 3817162 bytes (3727.70 Kbytes)
        100.00% of uncompressed inode table size (3817162 bytes)
Directory table size 2836366 bytes (2769.89 Kbytes)
        100.00% of uncompressed directory table size (2836366 bytes)
No duplicate files removed
Number of inodes 105469
Number of files 88688
Number of fragments 6878
Number of symbolic links  1493
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 15288
Number of ids (unique uids + gids) 1
Number of uids 1
        root (0)
Number of gids 1
        root (0)
infokng commented 1 year ago

Just keep open for 1 day and then I will close it