NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

slurmstepd: error: pyxis: enroot-nsenter: failed to create user namespace: Permission denied #37

Closed pvaldria closed 3 years ago

pvaldria commented 3 years ago

Hello Pyxis team ,

We are using Pyxis & Enroot on Oracle Linux 7.9 and we are getting an errors.

Environment: Baremetal 8xA100 PRETTY_NAME="Oracle Linux Server 7.9" Kernel: 5.4.17-2036.100.6.1.el7uek.x86_64 NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 slurm-20.11.2-1.el7.x86_64

Error: After applying Pyxis, we did a job test. However, namespace permission error occurred. Problem state

$ srun --container-image=/nfs/cluster/pyxis_test/enroot_image/centos.sqsh grep PRETTY /etc/os-release
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis:     enroot-nsenter: failed to create user namespace: Permission denied
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: inst-foysz-vocal-drake: task 0: Exited with exit code 1

Check the current settings

$ ./enroot-check_*.run --verify
Kernel version:
Linux version 5.4.17-2036.100.6.1.el7uek.x86_64 (mockbuild@jenkins-172-17-0-2-980b5770-aa59-43ca-ab82-ac28f1d4376e) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39.0.3) (GCC)) #2 SMP Thu Oct 29 17:04:48 PDT 2020
Kernel configuration:
CONFIG_NAMESPACES                 : OK
CONFIG_USER_NS                    : OK
CONFIG_SECCOMP_FILTER             : OK
CONFIG_OVERLAY_FS                 : OK (module)
CONFIG_X86_VSYSCALL_EMULATION     : OK
CONFIG_VSYSCALL_EMULATE           : KO (required if glibc <= 2.13)
CONFIG_VSYSCALL_NATIVE            : KO (required if glibc <= 2.13)

Kernel command line:

vsyscall=native                   : KO (required if glibc <= 2.13)
vsyscall=emulate                  : KO (required if glibc <= 2.13)

Kernel parameters:

user.max_user_namespaces          : OK
user.max_mnt_namespaces           : OK

Extra packages:

nvidia-container-cli              : OK

Kernel parameters updated and did a reboot. After reboot:

[root@inst-foysz-vocal-drake ~]# sudo grubby --info /boot/vmlinuz-5.4.17-2036.100.6.1.el7uek.x86_64
index=1
kernel=/boot/vmlinuz-5.4.17-2036.100.6.1.el7uek.x86_64
args="ro crashkernel=auto LANG=en_US.UTF-8 console=tty0 console=ttyS0,9600 rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0 rd.iscsi.bypass=1 rd.net.timeout.carrier=5 rd.net.timeout.dhcp=10 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi iscsi_param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 ipmi_si.trydefaults=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 ip=dhcp nouveau.modeset=0 rd.driver.blacklist=nouveau namespace.unpriv_enable=1 user_namespace.enable=1"
root=UUID=a46bac1b-7608-4d7c-b8c6-a3abca37be9f
initrd=/boot/initramfs-5.4.17-2036.100.6.1.el7uek.x86_64.img
title=Oracle Linux Server 7.9, with Unbreakable Enterprise Kernel 5.4.17-2036.100.6.1.el7uek.x86_64
[root@inst-foysz-vocal-drake ~]#

Investigation: I started looking at enroot source code and found bundle.sh which has logic to validate if the kernel parameters were set.
https://github.com/NVIDIA/enroot/blob/master/src/bundle.sh#L145:L155 You will see it only checks for “centos7|rhel7“. , should this be updated to add Oracle Linux like this ? centos7|rhel7|ol7”

On OCI cloud source /etc/os-release 2> /dev/null; echo "${ID-}${VERSION_ID-}" ol7.9

I did a sample “enroot” test like below and it worked fine with the default bundle.sh and with bundle.sh modified to include “ol7*” and both times it worked.

I ran this sample example. This is standalone, so it helps to isolate enroot issue from slurm or slurm pyxis invoking enroot.

Usage example:

# Import and start an Ubuntu image from DockerHub
$ enroot import docker://ubuntu
$ enroot create ubuntu.sqsh
$ enroot start ubuntu

I am not an enroot or pyxis expert, but I am assuming, since standalone enroot works, the issue could be in how pyxis is configured to use enroot or pyxis follows a code path which is different from the above simple example.

How was pyxis installed and configured?

Since RPM package is not currently provided, for Oracle Linux the source installation was used. From Pyxis github, clone source and build.

$ git clone https://github.com/NVIDIA/pyxis.git
$ cd pyxis
$ sudo make install
cc -std=gnu11 -O2 -g -Wall -Wunused-variable -fstack-protector-strong -fpic -D_GNU_SOURCE cc -std=gnu11 -O2 -g -Wall -Wunused-variable -fstack-protector-strong -fpic -D_GNU_SOURCE cc -std=gnu11 -O2 -g -Wall -Wunused-variable -fstack-protector-strong -fpic -D_GNU_SOURCE cc -std=gnu11 -O2 -g -Wall -Wunused-variable -fstack-protector-strong -fpic -D_GNU_SOURCE cc -std=gnu11 -O2 -g -Wall -Wunused-variable -fstack-protector-strong -fpic -D_GNU_SOURCE cc -std=gnu11 -O2 -g -Wall -Wunused-variable -fstack-protector-strong -fpic -D_GNU_SOURCE cc -std=gnu11 -O2 -g -Wall -Wunused-variable -fstack-protector-strong -fpic -D_GNU_SOURCE cc -std=gnu11 -O2 -g -Wall -Wunused-variable -fstack-protector-strong -fpic -D_GNU_SOURCE cc -shared -Wl,-znoexecstack -Wl,-zrelro -Wl,-znow -o spank_pyxis.so spank_pyxis.lds common.strip --strip-unneeded -R .comment spank_pyxis.so
install -d -m 755 /usr/local/lib/slurm
install -m 644 spank_pyxis.so /usr/local/lib/slurm
install -d -m 755 /usr/local/share/pyxis
echo 'required /usr/local/lib/slurm/spank_pyxis.so' | install -m 644 /dev/stdin /usr/local/
$ sudo mkdir /etc/slurm/plugstack.conf.d
$ ln -s /usr/local/share/pyxis/pyxis.conf /etc/slurm/plugstack.conf.d/pyxis.conf
# systemctl restart slurmd

Github's installation guide specifies the path of the spank_pyxis.so library built in /etc/slurm/plugstack.conf.d/pyxis.conf to recognize the Spank plugin specified by Slurm, but this path is set well in Slurm, it don't seem to recognize it. As an alternative, create a plugstack.conf file that explicitly defines the Spank Plugin library in the slurm config directory and specify the library path of spank_pyxis.so. Install Guide in Github ← have a problem

# ln -s /usr/local/share/pyxis/pyxis.conf /etc/slurm/plugstack.conf.d/pyxis.conf
# cat /usr/local/share/pyxis/pyxis.conf
required /usr/local/lib/slurm/spank_pyxis.so

We changed install guide ← works well Master node

# vi /etc/slurm/plugstack.conf
required /usr/local/lib/slurm/spank_pyxis.so
# systemctl restart slurmctld && systemctl status slurmctld
Worker Nodes
# for i in {1..2}; do ssh node$i "mkdir /usr/local/lib/slurm"; done
# for i in {1..2}; do scp /usr/local/lib/slurm/spank_pyxis.so
node$i:/usr/local/lib/slurm/; done
# for i in {1..2}; do scp /etc/slurm/plugstack.conf node$i:/etc/slurm/; done
# for i in {1..4}; do ssh node$i "systemctl restart slurmd && systemctl status slurmd";
done

$ cat /etc/enroot/enroot.conf 
#ENROOT_LIBRARY_PATH        /usr/lib/enroot
#ENROOT_SYSCONF_PATH        /etc/enroot
ENROOT_RUNTIME_PATH        /tmp/enroot/user-$(id -u)
#ENROOT_CONFIG_PATH         ${XDG_CONFIG_HOME}/enroot
#ENROOT_CACHE_PATH          ${XDG_CACHE_HOME}/enroot
ENROOT_DATA_PATH           /tmp/enroot-data/user-$(id -u)
#ENROOT_TEMP_PATH           ${TMPDIR:-/tmp}

Had to change the PATH of ENROOT from /run to /tmp , otherwise it fails with slurmstepd: error: pyxis: mkdir: cannot create directory ‘/run/enroot’: Permission denied

flx42 commented 3 years ago

Hi @pvaldria This is a duplicate of https://github.com/NVIDIA/enroot/issues/61 (that wsa filed against pyxis initially but I moved it to enroot)

Closing this one.