NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

Want to use MPS with Pyxis #27

Closed cponder closed 3 years ago

cponder commented 3 years ago

I find Pyxis to be a very easy way to manage multi-node MPI runs. But I need to be able to use MPS to share GPUs between processes. Is there a way to do this under Pyxis? I've tried some hacks but either they didn't work or my app ran slower than without it. I suspect the solution is to run the MPS outside the container (at the srun level) and somehow pass virtual GPU's to the container processes. But I wouldn't know how to do that.

3XX0 commented 3 years ago

It depends if MPS is managed by SLURM or not, if not it's just a matter of starting the daemon either in a prolog or inside your container entrypoint. There isn't much more to it, if the app runs slower, it's a problem with the app rather than MPS per se. You can always try to tune things see if it gets better, iirc things like CUDA_DEVICE_MAX_CONNECTIONS and CPU pinning is important

cponder commented 3 years ago

What I'd like to do is put a flag on the srun command-line, so that it divides the number of processes by the number of GPUs and arranges the sharing accordingly. There is already the --gpus-per-node flag if I want to vary the number of GPUs I want to use, for scaling studies.

3XX0 commented 3 years ago

You can use the MPS implementation of SLURM if you want to control the allocation of MPS shares to jobs of a given user: https://slurm.schedmd.com/gres.html But you can easily do the came thing with a prolog/job_submit/epilog and --comment for the flag

If you want to control each process and what they see, you can easily manipulate them using a combination of (SLURM_STEP_*, SLURM_LOCALID, etc) and (CUDA_VISIBLE_DEVICES, CUDA_MPS_ACTIVE_THREAD_PERCENTAGE, etc).

This isn't really related to Pyxis though. Pyxis merely containerizes the process being launched and doesn't do anything w.r.t resource allocation/selection. Having said that, you could definitely implement the above as a SPANK plugin instead.

cponder commented 3 years ago

To comment #2

         if the app runs slower, it's a problem with the app rather than MPS per se.

it looks like the MPS is not really engaging, my app is just falling-through to using

          Compute Mode                          : Default

so, yeah, the processes are sharing the GPU, but in a serialized way. And this is a problem with the setup rather than how the application operates. And to comment #4, I don't see any reference to MPS here

    fgrep -i mps -r /etc/slurm

so I'm guessing that the MPS GRES hasn't been installed, right? Pyxis is an issue because of its isolation of node resources from the containerized environment. The MPS daemon on the node is not visible inside the container, for example. I expect it's mostly an issue of mapping the pipe-files to be accessible.

3XX0 commented 3 years ago

so I'm guessing that the MPS GRES hasn't been installed, right?

That's right

Pyxis is an issue because of its isolation of node resources from the containerized environment. The MPS daemon on the node is not visible inside the container, for example. I expect it's mostly an issue of mapping the pipe-files to be accessible.

It should be accessible, if not this is a problem with enroot/libnvidia-container. Are you sure it is properly running on the GPU assigned to your job?

You can also start it from within the container or entrypoint, it shouldn't matter. Just make sure your container has the compute capability in NVIDIA_DRIVER_CAPABILITIES

cponder commented 3 years ago

It looks like I'm able to just start the MPS daemon from each process that I'm running, and its synchronization makes it only create one copy of the daemon per node. This way, for each process, the daemon is guaranteed to have been started before any MPI or GPU activity begins. The MPS developers indicate that this is (currently) the expected behavior. So I'm going to close this for now.

cponder commented 3 years ago

One thing to add -- I've toyed with scripts that start the daemon in the "prolog" phase, but end up with random numbers of daemons per node. I'm not sure why that happens, but the app seems to run ok in spite of this. In my mind it would make sense to have a node-wise prolog, but per-task or per-job are the only ones available. I could complain about these things to the SLURM developers, but likely they'd just direct me to use the MPS GRES instead.

cponder commented 3 years ago

And also, for the record, the GPUs on all the nodes can be put into exclusive-mode from inside the sbatch script

 srun -N $SLURM_JOB_NUM_NODES --ntasks-per-node 1 sudo nvidia-smi -c EXCLUSIVE_PROCESS

Note that (1) this hardware-setting is persistent, a subsequent command

 srun -N $SLURM_JOB_NUM_NODES --ntasks-per-node 1  nvidia-smi -q -d COMPUTE

will show the setting. Note also (2) that the setting is not made from inside a container. Trying that showed a permission error

  2: pyxis: importing docker image ...
  2: Unable to set the compute mode for GPU 00000000:07:00.0: Insufficient Permissions
  2: Terminating early due to previous errors.
                   ...

that may just need some permission settings inside the container, but that's a moot point since it's easier just to run without it.