NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

Clarify intended usage of --container-name #30

Closed sfeltman closed 3 years ago

sfeltman commented 3 years ago

We had been attempting to use --container-name to share enroot containers across Slurm job arrays. This ended up having a lot of issues due to PID sharing of between the array jobs running on the same machine (we didn't know it did this until reading the Pyxis code). While this could be fixed with some sort of option to disable PID sharing. Commit https://github.com/NVIDIA/pyxis/commit/a35027cf2ffa45cf702b117d215b1240aa6de22e added a prefix of  "pyxis_$JOBID" to the container-name which would then break the idea.

Please clarify the intended usage of --container-name. We had been hoping to use it for speeding up array jobs that use big containers on the same machine and manually managing the enroot container import directory before/after the job array.   Thanks

flx42 commented 3 years ago

Hello @sfeltman, the intent was to save a container state across job steps. So for example within a sbatch script or a salloc. In our cluster we had a Slurm epilog to manually cleanup the named containers at the end of the job, and the commit above was part of a change to move this cleanup logic to Pyxis directly. We didn't want to allow named containers to be shared across different jobs, since it's usually challenging to make sure you land on the same nodes across jobs.

I need to look more into what happens when job arrays are involved, I didn't test this use case yet. Perhaps there is an unexpected interaction with the SPANK API.

By the way, I don't quite understand what you mean by "PID sharing", could you explain?

flx42 commented 3 years ago

@sfeltman I see that the Slurm epilog is called for each job of the job array. So how were you planning to cleanup the named containers for this use case? I don't see any way to know when the job array is entirely finished on one node.

sfeltman commented 3 years ago

Hi @flx42,

Thanks for the explanation, I think some of my confusion stems from the cmdline help makes it seems like the feature is more general purpose.

With regards to "PID sharing", I meant container PID re-use from a running container with the same name.

In terms of job arrays and cleanup. The idea is to use a new job that is dependent on the array job completion to do cleanup on any potential node the array ran on.

flx42 commented 3 years ago

In terms of job arrays and cleanup. The idea is to use a new job that is dependent on the array job completion to do cleanup on any potential node the array ran on.

That seems tricky, making sure the follow-up job runs on exactly the same nodes.

But at the same time that use case seems similar to https://github.com/NVIDIA/pyxis/issues/28 So I'll consider changing the epilog pyxis config flag to trigger both the pyxis_$jobid container name prefix + the epilog cleanup. In that case you will be able to get the previous behavior by disabling this option.

flx42 commented 3 years ago

@sfeltman I just pushed https://github.com/NVIDIA/pyxis/commit/5a7d9007bb4d540236dea3b4bcfdc1cb6e43c3ec

You should be able to get the previous behavior with a config flag like the following:

$ cat /etc/slurm-llnl/plugstack.conf.d/pyxis.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

There will still be a pyxis_ prefix, but it won't use the job_id in the prefix anymore.

flx42 commented 3 years ago

Could you also describe the kind of problems you've seen with containers reusing existing PIDs? It just means it will share the container namespaces, is that an issue? I'm wondering if there is a bug lurking here.

sfeltman commented 3 years ago

Hi Felix,

Thanks for the update. Below I've pasted some records of some of the errors we were running into. This was with Pyxis version 0.8.1 and enroot 3.1.0. I played with addding a --no-container-pid-reuse option which fixed the issue. However, this was on top of the master branch, so it may have also been conflated with other changes since 0.8.1...

pyxis: reusing existing container PID
No devices found.
pyxis: reusing existing container PID
slurmstepd: error: pyxis: couldn't join cgroup namespace: Operation not permitted
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1
slurmstepd: error: pyxis: child 57362 failed with error code: 1
slurmstepd: error: pyxis: couldn't get list of existing container filesystems
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis:     NAME  PID  STATE  STARTED  TIME  MNTNS  USERNS  COMMAND
slurmstepd: error: pyxis: couldn't get list of containers
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1
pyxis: reusing existing container PID
slurmstepd: error: pyxis: unable to open mount namespace file: No such file or directory
slurmstepd: error: pyxis: couldn't get container attributes
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1
pyxis: reusing existing container PID
slurmstepd: error: pyxis: unable to open user namespace file: No such file or directory
slurmstepd: error: pyxis: couldn't get container attributes
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1
flx42 commented 3 years ago

Ok, it's probably a race condition between the different jobs here. For instance if the job being joined terminates when the new one is starting up.

sfeltman commented 3 years ago

With regards to sharing the container namespace, does this mean the cgroups resources are actually shared or the limit values just copied? With array jobs, each job in the array are independent jobs using the same limit values, but would have their own CPU/GPU/memory allocations.

flx42 commented 3 years ago

The cgroups should still be per-job, but it will get a bit weird for the jobs reusing the initial container, since they will join the cgroup namespace while being under a cgroup outside of this namespace. I don't think this has a functional impact.

sfeltman commented 3 years ago

I just confirmed the current HEAD https://github.com/NVIDIA/pyxis/commit/5a7d9007bb4d540236dea3b4bcfdc1cb6e43c3ec without any of my changes still exhibits the problems I mentioned when sharing the container name between array jobs (using container_scope=global option).

flx42 commented 3 years ago

Yes, this aspect is more tricky and for now I'm not too keen on adding another command-line argument for this, since the main intended use case is to have named containers with a job-level scope. So you should probably continue carrying your patch for disabling PID sharing, for now :)

avolkov1 commented 3 years ago

Is it possible to just specify a path to the sqsh files?

$ ls ~/enroot_images/
nvcr.io+nvidia+cuda+11.0.3-base-ubuntu18.04.sqsh

I just want to run srun with an option to pyxis/enroot to use that sqsh file.

3XX0 commented 3 years ago

@avolkov1 yes, see https://github.com/NVIDIA/pyxis/wiki/Usage#--container-image

avolkov1 commented 3 years ago

Oops, sorry. That's simple. I overlooked that part in the docs. Thank you.

flx42 commented 3 years ago

I think this is solved now, closing.

flx42 commented 3 years ago

I mean that we're probably not going to add a knob for disabling PID sharing when a container exists, at least not right now.