NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

pyxis: couldn't chdir to container cwd: Stale file handle #18

Closed coolinger closed 3 years ago

coolinger commented 4 years ago

Hi,

running with NFS $HOME. Since slurm is running with cgroups and without logind, I had to change the ENROOT_DIRs in /etc/enroot/enroot.conf:

ENROOT_RUNTIME_PATH ${HOME}/enroot/runtime ENROOT_CONFIG_PATH ${HOME}/enroot/config ENROOT_CACHE_PATH ${HOME}/enroot/cache ENROOT_DATA_PATH ${HOME}/enroot/data

enroot alone works:

srun --pty --reservation=enroot_installation bash -l i2dl@node10:~$ enroot start python36 df Filesystem 1K-blocks Used Available Use% Mounted on nfs:/storage/local/i2dl/enroot/data/python36 9561422848 2408323072 6671160320 27% / tmpfs 197432036 0 197432036 0% /sys/fs/cgroup devtmpfs 197397600 0 197397600 0% /dev tmpfs 197432036 0 197432036 0% /dev/shm tmpfs 197432036 0 197432036 0% /tmp tmpfs 197432036 0 197432036 0% /run tmpfs 197432036 0 197432036 0% /run/lock overlay 197432036 554028 196878008 1% /etc/hosts overlay 197432036 554028 196878008 1% /etc/hostname tmpfs 197432036 3621192 193810844 2% /etc/resolv.conf nfs:/storage/local/i2dl/enroot/data/python36/.lock 9561422848 2408323072 6671160320 27% /.lock

pyxis does not work:

srun --container-image=python:3.6 --container-name=python36 --reservation=enroot_installation df slurmstepd-node10: pyxis: reusing existing container filesystem slurmstepd-node10: pyxis: starting container ... slurmstepd-node10: error: pyxis: couldn't chdir to container cwd: Stale file handle slurmstepd-node10: error: spank: required plugin spank_pyxis.so: task_init_privileged() failed with rc=-1 slurmstepd-node10: error: spank_task_init_privileged failed srun: error: node10: task 0: Exited with exit code 1

If I move the ${HOME}/enroot/data to a non-fs path and symlink it, pyxis works.

But since enroot itself works inside the NFS path, I guess this is an issue with pyxis and hope for a possible solution.

Thanks Quirin

flx42 commented 4 years ago

Hi Quirin,

It's possible that the approach we use in pyxis won't work for a NFS file handle. I will try to test when I have time.

While ENROOT_CACHE_PATH can probably be used on a NFS (to share pulled container layers), the other ones are a bit risky. For instance, sharing ENROOT_DATA_PATH will create conflicts when a multi-node job tries to create a container named "foo" on all its nodes at the same time.

In our case, we use a custom enroot config in /etc/enroot/enroot.conf:

ENROOT_RUNTIME_PATH /run/enroot/user-$(id -u)
ENROOT_CACHE_PATH /raid/enroot-cache/group-$(id -g)
ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u)

And then, in a Slurm prolog, we mkdir/chown these directories (to workaround the fact that /run/user/ is not mounted by logind), for instance:

runtime_path="$(sudo -u "$SLURM_JOB_USER" sh -c 'echo "/run/enroot/user-$(id -u)"')"
mkdir -p "$runtime_path"
chown "$SLURM_JOB_UID:$(id -g "$SLURM_JOB_UID")" "$runtime_path"
chmod 0700 "$runtime_path"

This comes from the Slurm enroot DeepOps role: https://github.com/NVIDIA/deepops/blob/6b57cd1ccc80b74d671039b7609525ae70a9f8e8/roles/slurm-perf/templates/etc/slurm/prolog.d/50-all-enroot-dirs

In one of our cluster, we have plenty of RAM so /tmp is a tmpfs so it makes all container operations super fast. In other cases, we store the container filesystems (ENROOT_DATA_PATH) to the local RAID array. It is slower but doesn't take RAM.

coolinger commented 4 years ago

Thanks for the reply. The prolog scripts are a very welcome inspiration. Our cluster is set up with netboot nodes with enough ram, combining their local SSDs into a Ceph cluster and using CephFS. I will set up enroot now that it uses the prolog-created runtime, the data in a per user cephfs and a nfs shared cache. I will strongly suggest the users to create the containers using enroot before running on slurm, and only use --container-name in srun.

flx42 commented 4 years ago

I will strongly suggest the users to create the containers using enroot before running on slurm, and only use --container-name in srun.

Why? Do you want to have a single container filesystem across all nodes?

coolinger commented 4 years ago

My idea was to use enroot/pyxis as a provider of different runtime environments for the user's code if needed. Like old code needing Ubuntu 16.04 libraries and new code needing 20.04, whereas the slurm nodes are all running 18.04 right now. The code itself is in the user's $HOME, which is mounted into the container. As I understood enroot initially, the containers would be a chroot like system, and with them being read-only, there should be no problems running the same container from multiple nodes? I somehow assumed the containers itself to use an overlayfs and not be persistent about changes in the rootfs...

flx42 commented 4 years ago

As I understood enroot initially, the containers would be a chroot like system, and with them being read-only, there should be no problems running the same container from multiple nodes?

It was not exactly designed for this use case. The enroot "hooks" might become confused here.

In our case we have one copy of the container filesystem per node. And it's possible to store a squashfs image on a shared filesystem and pass it to pyxis, but it will still be extracted locally. Our use case does require writing to the container filesystem.

I somehow assumed the containers itself to use an overlayfs and not be persistent about changes in the rootfs...

This is possible with enroot by doing enroot start of a squashfs file directly, without an enroot create: https://github.com/NVIDIA/enroot/blob/master/doc/cmd/start.md#starting-container-images This is not exposed in pyxis today. Looks like directly using enroot might be a better fit for your use case as of today.

coolinger commented 4 years ago

It was not exactly designed for this use case. The enroot "hooks" might become confused here.

Yep. That happened just now when the usage started to rise and a second job was started on the same node.

This is possible with enroot by doing enroot start of a squashfs file directly, without an enroot create: https://github.com/NVIDIA/enroot/blob/master/doc/cmd/start.md#starting-container-images This is not exposed in pyxis today. Looks like directly using enroot might be a better fit for your use case as of today.

I switched to do that now.

Thanks for your time, keep up the good work. Should I open an issue as feature request for my use case?

flx42 commented 4 years ago

Thanks for your time, keep up the good work. Should I open an issue as feature request for my use case?

Sure, why not! I don't make any guarantee when (if ever!) we'll get to it. We'll see how our needs evolve internally.

Thanks