Open BenWibking opened 9 months ago
So it doesn't look like there's great handling of gpu's on the slurm adapter at the moment, despite there being a hook for adding the gpus=.. bit to the header which I think passes through on the steps' 'gpus:
And just to better understand the final use case, are you also looking for having say 4 different tasks (or srun's) inside this step, one per gpu, or perhaps preferring to keep each one separate and pack the allocation with many jobs using an embedded flux instance? Either way it looks like we'll need to wire up some extra hooks/pass through for these gpu related args in the slurm adapter. Think we could also add some 'c', and 'g' flags to the new launcher syntax if you want more independent control of multiple $(LAUNCHER) tokens in a step (see this new style launcher)
The use case for this job step is just a single MPI job across 1+ nodes. (Other job steps workflow steps are CPU-only, so they need to have a different binding/ntasks-per-node, but for now, that's a separate issue.)
The somewhat nonstandard options are just to get the right mapping of NUMA domains to GPUs due to the weird topology on this system, plus a workaround to avoid cgroup resource isolation being applied to the GPUs (since that prevents CUDA IPC from working between MPI ranks).
Using
#SBATCH --gpu-bind=none
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
might accomplish the same binding, but I haven't tested that yet. Is there a built-in way to specify this alternative set of SLURM options?
No, it doesn't look like there's a better way built in to set any extra/un-known sbatch options than what you're currently doing by putting them at the top of your step cmd.
Will have to look into exposing more of these options/bindings across the script adapters. Other than the options in your initial snippet, are there any other of the numerous sbatch/srun options you would be interested in?
No, it doesn't look like there's a better way built in to set any extra/un-known sbatch options than what you're currently doing by putting them at the top of your step cmd.
Will have to look into exposing more of these options/bindings across the script adapters. Other than the options in your initial snippet, are there any other of the numerous sbatch/srun options you would be interested in?
Not that I can think of. The above examples should cover it.
Is there a recommended way to specify extra SLURM options for GPU bindings?
I tried using the
args:
batch block key (https://maestrowf.readthedocs.io/en/latest/Maestro/scheduling.html), but the options did not get propagated to the *.sh job script.Following https://github.com/LLNL/maestrowf/issues/340, the workaround I've used so far is to specify these options as part of the run command so that they get copied into the job script: