NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
282 stars 31 forks source link

Cannot join running container executing sshd #45

Closed flx42 closed 3 years ago

flx42 commented 3 years ago

Description

@3XX0 reported the following issue:

$ srun --container-image=ubuntu --container-name=ubuntu sh -c 'apt-get update && apt-get install -y openssh-server'
$ srun --container-name=ubuntu --no-container-remap-root sshd -d -p 2222

From another terminal, we can't exec into this existing container:

$ srun --overlap --jobid=${JOBID} --container-name=ubuntu --pty bash
slurmstepd: error: pyxis: failed to set  2222 [listener] 0 of 10-100 startups: Bad argument
slurmstepd: error: pyxis: couldn't read container environment

Root cause

pyxis notices that the container named ubuntu is already running, so it uses this PID for the namespaces and the environment. However, sshd modifies the name of the running process. On BSD setproctitle(3) is available, but on Linux it has to hack the content of argv and environ, so the procfs file /proc/<PID>/environ becomes invalid:

$ cat /proc/480286/environ
 2222 [listener] 0 of 10-100 startupsp

So pyxis fails to import this file.

Workaround

Launch sshd below a sh process:

$ srun --container-name=ubuntu --no-container-remap-root sh -c '/usr/sbin/sshd -d -p 2222'

It doesn't work with bash because of implicit execs.

Fix

Only use the existing PID for namespaces, always use a new enroot start to get the environment variables.