NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

failed to create user runtime path #4

Closed biocyberman closed 4 years ago

biocyberman commented 4 years ago

After some modifications from installation guide, pyxis seems to work:


srun --help|grep container
      --container-image=[USER@][REGISTRY#]IMAGE[:TAG]
      --container-mounts=SRC:DST[,SRC:DST...]
                              [pyxis] bind mount[s] inside the container
      --container-workdir=PATH
                              [pyxis] working directory inside the container
      --container-name=NAME   [pyxis] name to use for saving and loading the
                              container on the host. Unnamed containers are
                              containers are not. If a container with this name
                              already exists, the existing container is used and

However, this command failed:


srun --container-image=centos grep NAME /etc/os-release
srun: job xxxx queued and waiting for resources
srun: job xxxx has been allocated resources
srun: error: task 0 launch failed: Plugin initialization failed

Slurm log in /var/log/slurmd.log on the compute node has these lines:

[2019-09-24T10:10:06.901] [321.0] error: pyxis: couldn't mkdir /run/pyxis/100005: No such file or directory
[2019-09-24T10:10:06.901] [321.0] error: spank: required plugin spank_pyxis.so: init_post_opt() failed with rc=-1
[2019-09-24T10:10:06.901] [321.0] error: Plugin stack initialization failed.

/run is owned by root and not writable by normal users. So, should there be a configuration option to set the RUNTIME_PATH to somewhere else?

https://github.com/NVIDIA/pyxis/blob/a521ac06839660cebe482d4cf1a16792e8815f8f/common.h#L7

biocyberman commented 4 years ago

I changed two lines in the common.h to these:

#define PYXIS_RUNTIME_PATH "/tmp/pyxis"
#define PYXIS_USER_RUNTIME_PATH "/tmp/pyxis/%d"

Make install again, I could get a bit further:

srun --container-image=centos grep NAME /etc/os-release
srun: job 323 queued and waiting for resources
srun: job 323 has been allocated resources
slurmstepd: pyxis: running "enroot import" ...
slurmstepd: error: pyxis: child 33628 failed with error code: 1
slurmstepd: error: pyxis: enroot import failed
slurmstepd: pyxis: printing contents of log file ...
slurmstepd: error: pyxis:     mktemp: failed to create directory via template ‘/scratch/enroot/tmp/10005/enroot.XXXXXXXXXX’: No such file or directory
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init_privileged() failed with rc=-1
slurmstepd: error: spank_task_init_privileged failed

This is because I have ENROOT_TEMP_PATH set to /scratch/enroot/tmp/${UID}". Changing it to/scratch/enroot` and I could get this:

srun --container-image=centos grep NAME /etc/os-release
srun: job XXXX queued and waiting for resources
srun: job XXXX has been allocated resources
slurmstepd: pyxis: running "enroot import" ...
slurmstepd: pyxis: running "enroot create" ...
slurmstepd: pyxis: running "enroot start" ...
slurmstepd: error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures).
NAME="CentOS Linux"
PRETTY_NAME="CentOS Linux 7 (Core)"
CPE_NAME="cpe:/o:centos:centos:7"

So it seems working, but it is a tweak at the moment. And still wonder what does this error mean:

slurmstepd: error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures).

flx42 commented 4 years ago

Hello @biocyberman, thanks for the bug report!

The following function should be called in the slurmd context (as root) and it will create /run/pyxis: https://github.com/NVIDIA/pyxis/blob/master/pyxis_slurmd.c#L17-L42 Could you check if this directory exists before the issue arises? If yes, what are the permissions on this folder? It it doesn't exist, you might need to restart the slurmd service for this function to be called. Looks like I need to add this to the documentation.

Afterwards, for each job, the plugin will attempt to create directory /run/pyxis/<uid> here: https://github.com/NVIDIA/pyxis/blob/master/pyxis_slurmstepd.c#L353-L379 As the comment indicates, this is done as root and then chowned to the user. This is the part that is failing for you. If restarting slurmd doesn't solve it, it might be a permission issue.

But you are right that we should probably make it configurable, since any directory with the sticky bit set would work.

flx42 commented 4 years ago

Also slurmstepd: error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures). is probably unrelated to pyxis, I've never seen that on my side.

flx42 commented 4 years ago

@biocyberman I've documented the missing step (restarting slurmd), thanks for the bug report!