Closed karanveersingh5623 closed 1 year ago
@flx42 , please let me know if i am missing something
Can you try with enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
?
@flx42 , this is Nvidia Bright Cluster Manager and I can only find enroot on Master . I have installed pyxis + enroot using cm-wlm-setup , i have used it earlier and it was working fine , dont know what I am missing .
[root@master88 enroot]# find / -name enroot
/run/user/0/enroot
/root/.cache/enroot
/root/.local/share/enroot
/cm/images/default-image-node002/etc/enroot
/cm/images/default-image-node002/usr/share/enroot
/cm/images/default-image-node002/usr/bin/enroot
/cm/images/default-image-node002/usr/lib/enroot
[root@master88 enroot]#
[root@master88 enroot]#
[root@master88 enroot]# ll /cm/images/default-image-node002/usr/bin/enroot*
-rwxr-xr-x 1 root root 17439 Nov 13 2021 /cm/images/default-image-node002/usr/bin/enroot
-rwxr-xr-x 1 root root 34728 Nov 13 2021 /cm/images/default-image-node002/usr/bin/enroot-aufs2ovlfs
-rwxr-xr-x 1 root root 20146 Nov 13 2021 /cm/images/default-image-node002/usr/bin/enroot-makeself
-rwxr-xr-x 1 root root 34728 Nov 13 2021 /cm/images/default-image-node002/usr/bin/enroot-mksquashovlfs
-rwxr-xr-x 1 root root 63400 Nov 13 2021 /cm/images/default-image-node002/usr/bin/enroot-mount
-rwxr-xr-x 1 root root 43080 Nov 13 2021 /cm/images/default-image-node002/usr/bin/enroot-nsenter
-rwxr-xr-x 1 root root 75696 Nov 13 2021 /cm/images/default-image-node002/usr/bin/enroot-switchroot
[root@master88 enroot]#
[root@master88 enroot]#
[root@master88 enroot]# /cm/images/default-image-node002/usr/bin/enroot --help
/cm/images/default-image-node002/usr/bin/enroot: line 111: /usr/lib/enroot/common.sh: No such file or directory
Looks like this is the issue , not sure , there is no path to /usr/lib/enroot/common.sh , its different
[root@master88 enroot]# ll /cm/images/default-image-node002/usr/lib/enroot/
total 60
-rw-r--r-- 1 root root 12128 Nov 13 2021 bundle.sh
-rw-r--r-- 1 root root 7222 Nov 13 2021 common.sh
-rw-r--r-- 1 root root 16070 Nov 13 2021 docker.sh
-rw-r--r-- 1 root root 22711 Nov 13 2021 runtime.sh
below is my worker node
[root@node004 ~]# find / -name enroot
[root@node004 ~]# find / -name enroot*
/cm/shared/apps/slurm/var/etc/enroot-sysctl.conf
/cm/shared/apps/slurm/var/etc/enroot.conf
[root@node004 ~]#
[root@node004 ~]#
[root@node004 ~]#
[root@node004 ~]# cat /cm/shared/apps/slurm/var/etc/enroot.conf
# Working directory for enroot
ENROOT_RUNTIME_PATH /run/enroot/runtime/$(id -u)
# Directory where container layers are stored
ENROOT_CACHE_PATH /run/enroot/cache/$(id -u)
# Directory where the filesystems of running containers are stored
ENROOT_DATA_PATH /var/lib/enroot/data/$(id -u)
# Options passed to mksquashfs to produce container images.
ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates
# Mount the current user's home directory by default.
ENROOT_MOUNT_HOME yes
# Path to user configuration files
#ENROOT_CONFIG_PATH ${HOME}/.config/enroot
# Restrict /dev inside the container to a minimal set of devices.
ENROOT_RESTRICT_DEV yes
# Make the container root filesystem writable by default.
ENROOT_ROOTFS_WRITABLE yes
# Options passed to zstd to compress digest layers.
ENROOT_ZSTD_OPTIONS -1
# Number of times network operations should be retried.
ENROOT_TRANSFER_RETRIES 5
# Maximum time in seconds to wait for connections establishment (0 means unlimited).
ENROOT_CONNECT_TIMEOUT 60
# Maximum time in seconds to wait for network operations to complete (0 means unlimited).
ENROOT_TRANSFER_TIMEOUT 1200
# Maximum number of concurrent connections (0 means unlimited).
ENROOT_MAX_CONNECTIONS 10
# Path to library sources
#ENROOT_LIBRARY_PATH /usr/lib/enroot
# Path to system configuration file
#ENROOT_SYSCONF_PATH /etc/enroot
# Path to temporary directory
#ENROOT_TEMP_PATH ${TMPDIR:-/tmp}
# Gzip program used to uncompress digest layers.
#ENROOT_GZIP_PROGRAM gzip
# Remap the current user to root inside containers by default.
#ENROOT_REMAP_ROOT no
# Maximum number of processors to use for parallel tasks (0 means unlimited).
#ENROOT_MAX_PROCESSORS $(nproc)
# Use a login shell to run the container initialization.
#ENROOT_LOGIN_SHELL yes
# Allow root to retain his superuser privileges inside containers.
#ENROOT_ALLOW_SUPERUSER no
# Use HTTP for outgoing requests instead of HTTPS (UNSECURE!).
ENROOT_ALLOW_HTTP yes
# Include user-specific configuration inside bundles by default.
#ENROOT_BUNDLE_ALL no
# Generate an embedded checksum inside bundles by default.
#ENROOT_BUNDLE_CHECKSUM no
# Always use --force on command invocations.
#ENROOT_FORCE_OVERRIDE no
# SSL certificates settings
#SSL_CERT_DIR
#SSL_CERT_FILE
# Proxy settings
#all_proxy
#no_proxy
#http_proxy
#https_proxy
[root@master88 ~]# cp -r /cm/images/default-image-node002/usr/lib/enroot /usr/lib/
[root@master88 ~]# /cm/images/default-image-node002/usr/bin/enroot --help
Usage: enroot COMMAND [ARG...]
Command line utility for manipulating container sandboxes.
Commands:
batch [options] [--] CONFIG [COMMAND] [ARG...]
bundle [options] [--] IMAGE
create [options] [--] IMAGE
exec [options] [--] PID COMMAND [ARG...]
export [options] [--] NAME
import [options] [--] URI
list [options]
remove [options] [--] NAME...
start [options] [--] NAME|IMAGE [COMMAND] [ARG...]
version
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]#
[root@master88 ~]# /cm/images/default-image-node002/usr/bin/enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[ERROR] Command not found: jq
Installed enroot manually on 1 of the worker nodes , still not working
[root@node003 ~]# dnf install -y https://github.com/NVIDIA/enroot/releases/download/v3.4.1/enroot-3.4.1-1.el8.x86_64.rpm
Last metadata expiration check: 0:00:12 ago on Thu 11 May 2023 02:16:17 PM KST.
enroot-3.4.1-1.el8.x86_64.rpm 96 kB/s | 110 kB 00:01
Dependencies resolved.
================================================================================================================================================================================================================
Package Architecture Version Repository Size
================================================================================================================================================================================================================
Installing:
enroot x86_64 3.4.1-1.el8 @commandline 110 k
Installing dependencies:
jq x86_64 1.6-3.el8 appstream 201 k
oniguruma x86_64 6.8.2-2.el8 appstream 186 k
parallel noarch 20190922-1.el8 epel 351 k
Transaction Summary
================================================================================================================================================================================================================
Install 4 Packages
Total size: 849 k
Total download size: 738 k
Installed size: 2.6 M
Downloading Packages:
(1/3): oniguruma-6.8.2-2.el8.x86_64.rpm 597 kB/s | 186 kB 00:00
(2/3): jq-1.6-3.el8.x86_64.rpm 571 kB/s | 201 kB 00:00
(3/3): parallel-20190922-1.el8.noarch.rpm 868 kB/s | 351 kB 00:00
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 413 kB/s | 738 kB 00:01
Extra Packages for Enterprise Linux 8 - x86_64 1.6 MB/s | 1.6 kB 00:00
Importing GPG key 0x2F86D6A1:
Userid : "Fedora EPEL (8) <epel@fedoraproject.org>"
Fingerprint: 94E2 79EB 8D8F 25B2 1810 ADF1 21EA 45AB 2F86 D6A1
From : /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-8
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : parallel-20190922-1.el8.noarch 1/4
Installing : oniguruma-6.8.2-2.el8.x86_64 2/4
Running scriptlet: oniguruma-6.8.2-2.el8.x86_64 2/4
Installing : jq-1.6-3.el8.x86_64 3/4
Installing : enroot-3.4.1-1.el8.x86_64 4/4
Running scriptlet: enroot-3.4.1-1.el8.x86_64 4/4
Verifying : jq-1.6-3.el8.x86_64 1/4
Verifying : oniguruma-6.8.2-2.el8.x86_64 2/4
Verifying : parallel-20190922-1.el8.noarch 3/4
Verifying : enroot-3.4.1-1.el8.x86_64 4/4
Installed:
enroot-3.4.1-1.el8.x86_64 jq-1.6-3.el8.x86_64 oniguruma-6.8.2-2.el8.x86_64 parallel-20190922-1.el8.noarch
Complete!
[root@node003 ~]# enroot
enroot enroot-aufs2ovlfs enroot-makeself enroot-mksquashovlfs enroot-mount enroot-nsenter enroot-switchroot
[root@node003 ~]# enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[INFO] Querying registry for permission grant
curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number
Reinstalled the slurm cluster again on BCM ( bright cluster manager ) but still srun fails with same error , dont know whats going on :(
/cm/images/default-image-node002/usr/bin/enroot: line 111: /usr/lib/enroot/common.sh: No such file or directory
Looks like a problem with the enroot installation from BCM. This should fail with all container images.
[root@node003 ~]# enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[INFO] Querying registry for permission grant
curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number
This looks like a problem with the local docker registry, is enroot working with DockerHub images in this setup?
[root@master88 ~]# cp -r /cm/images/default-image-node002/usr/lib/enroot /usr/lib/ [root@master88 ~]# /cm/images/default-image-node002/usr/bin/enroot --help Usage: enroot COMMAND [ARG...] Command line utility for manipulating container sandboxes. Commands: batch [options] [--] CONFIG [COMMAND] [ARG...] bundle [options] [--] IMAGE create [options] [--] IMAGE exec [options] [--] PID COMMAND [ARG...] export [options] [--] NAME import [options] [--] URI list [options] remove [options] [--] NAME... start [options] [--] NAME|IMAGE [COMMAND] [ARG...] version [root@master88 ~]# [root@master88 ~]# [root@master88 ~]# [root@master88 ~]# [root@master88 ~]# /cm/images/default-image-node002/usr/bin/enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4 [ERROR] Command not found: jq
I copied the files to mentioned location
/cm/images/default-image-node002/usr/bin/enroot: line 111: /usr/lib/enroot/common.sh: No such file or directory
Looks like a problem with the enroot installation from BCM. This should fail with all container images.
[root@node003 ~]# enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4 [INFO] Querying registry for permission grant curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number
This looks like a problem with the local docker registry, is enroot working with DockerHub images in this setup?
@flx42 , I tried with dockerhub ubuntu , it worked . My docker registry is working fine with docker pull , why its not working with enroot , do i have to add any credentials somewhere ?
[root@node003 ~]# enroot import docker://ubuntu
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: <anonymous>
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Downloading 2 missing layers...
100% 2:0=0s dbf6a9befcdeecbb8813406afbd62ce81394e3869d84599f19f941aa5c74f3d1
[INFO] Extracting image layers...
100% 1:0=0s dbf6a9befcdeecbb8813406afbd62ce81394e3869d84599f19f941aa5c74f3d1
[INFO] Converting whiteouts...
100% 1:0=0s dbf6a9befcdeecbb8813406afbd62ce81394e3869d84599f19f941aa5c74f3d1
[INFO] Creating squashfs filesystem...
Parallel mksquashfs: Using 88 processors
Creating 4.0 filesystem on /root/ubuntu.sqsh, block size 131072.
[=============================================================================================================================================================================================|] 2931/2931 100%
Exportable Squashfs 4.0 filesystem, lzo compressed, data block size 131072
uncompressed data, compressed metadata, compressed fragments, compressed xattrs
duplicates are removed
Filesystem size 56606.98 Kbytes (55.28 Mbytes)
74.29% of uncompressed filesystem size (76199.39 Kbytes)
Inode table size 42820 bytes (41.82 Kbytes)
36.73% of uncompressed inode table size (116577 bytes)
Directory table size 35563 bytes (34.73 Kbytes)
50.03% of uncompressed directory table size (71090 bytes)
Number of duplicate files found 130
Number of inodes 3519
Number of files 2628
Number of fragments 273
Number of symbolic links 212
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 679
Number of ids (unique uids + gids) 1
Number of uids 1
root (0)
Number of gids 1
root (0)
This is very strange issue , BCM is deploying successfully slurm cluster with pyxis support , then where is the problem ? I am stuck in my project and I exhausted almost all my options .
@flx42 , BCM enroot worked :)
[root@node002 ~]# enroot import docker://192.168.61.4:5000#/cosmoflow-nvidia:0.4
[INFO] Querying registry for permission grant
[INFO] Permission granted
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Downloading 63 missing layers...
100% 63:0=0s e1471ee6a09e6b27237602eb8eda43e616af5f0ef9261aaab8241b33c545dfbf
[INFO] Extracting image layers...
100% 62:0=0s 35807b77a593c1147d13dc926a91dcc3015616ff7307cc30442c5a8e07546283
[INFO] Converting whiteouts...
100% 62:0=0s 35807b77a593c1147d13dc926a91dcc3015616ff7307cc30442c5a8e07546283
[INFO] Creating squashfs filesystem...
Parallel mksquashfs: Using 128 processors
Creating 4.0 filesystem on /root/+cosmoflow-nvidia+0.4.sqsh, block size 131072.
[=========================================================================================================================================================================================\] 180702/180702 100%
Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
uncompressed data, uncompressed metadata, uncompressed fragments, uncompressed xattrs
duplicates are not removed
Filesystem size 12786712.34 Kbytes (12487.02 Mbytes)
99.92% of uncompressed filesystem size (12796732.07 Kbytes)
Inode table size 3817162 bytes (3727.70 Kbytes)
100.00% of uncompressed inode table size (3817162 bytes)
Directory table size 2836366 bytes (2769.89 Kbytes)
100.00% of uncompressed directory table size (2836366 bytes)
No duplicate files removed
Number of inodes 105469
Number of files 88688
Number of fragments 6878
Number of symbolic links 1493
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 15288
Number of ids (unique uids + gids) 1
Number of uids 1
root (0)
Number of gids 1
root (0)
Just keep open for 1 day and then I will close it
Not able to run pyxis with enroot using local docker registry images .