NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
282 stars 31 forks source link

Pyxis failing on bright cluster manager , failed to create opaque ovlfs whiteout #87

Closed karanveersingh5623 closed 2 years ago

karanveersingh5623 commented 2 years ago

HI , I am enabling pyxis + enroot using the following link on bright cluster manager .

https://kb.brightcomputing.com/knowledge-base/using-enroot-and-pyxis-in-bright-cluster-manager/

my runtime is on local home folder . below details of my enroot.conf

[root@bright88 burst-buffer]# cat /etc/enroot/enroot.conf
#ENROOT_LIBRARY_PATH        /usr/lib/enroot
#ENROOT_SYSCONF_PATH        /etc/enroot
ENROOT_RUNTIME_PATH        /home/enroot/runtime
#ENROOT_CONFIG_PATH         ${XDG_CONFIG_HOME}/enroot
#ENROOT_CACHE_PATH          ${XDG_CACHE_HOME}/enroot
ENROOT_DATA_PATH           /home/enroot/data
ENROOT_TEMP_PATH           /home/enroot/tmp

# Gzip program used to uncompress digest layers.
#ENROOT_GZIP_PROGRAM        gzip

# Options passed to zstd to compress digest layers.
#ENROOT_ZSTD_OPTIONS        -1

# Options passed to mksquashfs to produce container images.
#ENROOT_SQUASH_OPTIONS      -comp lzo -noD

# Make the container root filesystem writable by default.
#ENROOT_ROOTFS_WRITABLE     no

# Remap the current user to root inside containers by default.
#ENROOT_REMAP_ROOT          no

# Maximum number of processors to use for parallel tasks (0 means unlimited).
#ENROOT_MAX_PROCESSORS      $(nproc)

# Maximum number of concurrent connections (0 means unlimited).
#ENROOT_MAX_CONNECTIONS     10

# Maximum time in seconds to wait for connections establishment (0 means unlimited).
#ENROOT_CONNECT_TIMEOUT     30

# Maximum time in seconds to wait for network operations to complete (0 means unlimited).
#ENROOT_TRANSFER_TIMEOUT    0

# Number of times network operations should be retried.
#ENROOT_TRANSFER_RETRIES    0

# Use a login shell to run the container initialization.
#ENROOT_LOGIN_SHELL         yes

# Allow root to retain his superuser privileges inside containers.
#ENROOT_ALLOW_SUPERUSER     no

# Use HTTP for outgoing requests instead of HTTPS (UNSECURE!).
ENROOT_ALLOW_HTTP          yes

# Include user-specific configuration inside bundles by default.
#ENROOT_BUNDLE_ALL          no

# Generate an embedded checksum inside bundles by default.
#ENROOT_BUNDLE_CHECKSUM     no

# Mount the current user's home directory by default.
#ENROOT_MOUNT_HOME          no

# Restrict /dev inside the container to a minimal set of devices.
#ENROOT_RESTRICT_DEV        no

# Always use --force on command invocations.
#ENROOT_FORCE_OVERRIDE      no

# SSL certificates settings:
#SSL_CERT_DIR
#SSL_CERT_FILE

# Proxy settings:
#all_proxy
#no_proxy
#http_proxy
#https_proxy
#ENROOT_RUNTIME_PATH        ${HOME}/enroot

below is the command I am running

[root@bright88 burst-buffer]# ENROOT_ALLOW_HTTP=yes srun -N 1 -G 1 --ntasks=1 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=cosmoflow-preprocess --container-workdir=/mnt/ --container-mounts=/mnt/lustre:/mnt bash /mnt/mxnet/tools/init_datasets.sh
pyxis: importing docker image ...

slurmstepd: error: pyxis: child 46128 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis:     [INFO] Querying registry for permission grant
slurmstepd: error: pyxis:     [INFO] Permission granted
slurmstepd: error: pyxis:     [INFO] Fetching image manifest list
slurmstepd: error: pyxis:     [INFO] Fetching image manifest
slurmstepd: error: pyxis:     [INFO] Found all layers in cache
slurmstepd: error: pyxis:     [INFO] Extracting image layers...
slurmstepd: error: pyxis:     [INFO] Converting whiteouts...
slurmstepd: error: pyxis:     enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout: /home/enroot/tmp/enroot.qrMDQzRJ5Z/1/tmp/ompi.14ff54ba6d26.0/: Not supported
slurmstepd: error: pyxis:     enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout: /home/enroot/tmp/enroot.qrMDQzRJ5Z/2/usr/lib/firmware/: Not supported
slurmstepd: error: pyxis:     enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout: /home/enroot/tmp/enroot.qrMDQzRJ5Z/3/root/.cache/pip/http/0/2/: Not supported
slurmstepd: error: pyxis:     enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout: /home/enroot/tmp/enroot.qrMDQzRJ5Z/4/workspace/cosmoflow/configs/: Not supported
slurmstepd: error: pyxis:     enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout: /home/enroot/tmp/enroot.qrMDQzRJ5Z/5/root/.cache/pip/http/1/b/: Not supported
slurmstepd: error: pyxis:     enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout: /home/enroot/tmp/enroot.qrMDQzRJ5Z/7/workspace/cosmoflow/: Not supported
slurmstepd: error: pyxis:     enroot-aufs2ovlfs: failed to create opaque ovlfs whiteout: /home/enroot/tmp/enroot.qrMDQzRJ5Z/8/etc/infiniband-diags/: Not supported
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: node001: task 0: Exited with exit code 1
[root@bright88 burst-buffer]# sudo make setcap
make: *** No rule to make target `setcap'.  Stop.
[root@bright88 burst-buffer]# arch=$(uname -m)
[root@bright88 burst-buffer]# yum install -y https://github.com/NVIDIA/enroot/releases/download/v3.4.0/enroot+caps-3.4.0-2.el7.${arch}.rpm
Loaded plugins: fastestmirror, priorities
enroot+caps-3.4.0-2.el7.x86_64.rpm                                                                                                                             | 3.5 kB  00:00:00
Examining /var/tmp/yum-root-ODvr9o/enroot+caps-3.4.0-2.el7.x86_64.rpm: enroot+caps-3.4.0-2.el7.x86_64
/var/tmp/yum-root-ODvr9o/enroot+caps-3.4.0-2.el7.x86_64.rpm: does not update installed package.
Error: Nothing to do
[root@bright88 burst-buffer]# getcap /usr/bin/enroot-aufs2ovlfs
/usr/bin/enroot-aufs2ovlfs = cap_sys_admin,cap_mknod+ep
flx42 commented 2 years ago

Is /home/enroot on a NFS filesystem? Extended attributes might not be supported on this filesystem. I think you need a NFS server that supports extended attributes, NFS v4, and a recent Linux kernel to have a chance for this to work. Maybe @3XX0 can document the filesystems requirement in the enroot documentation.

But also I don't recommend using a NFS directory for enroot, for performance reasons.

karanveersingh5623 commented 2 years ago

@flx42 , thanks felix ...yea I just forgot NFS , Lustre are not good options for enroot runtimes , thanks for the info