LLNL / maestrowf

A tool to easily orchestrate general computational workflows both locally and on supercomputers
https://maestrowf.readthedocs.io
MIT License
134 stars 43 forks source link

SLURM: specifying extra arguments for GPU binding #436

Open BenWibking opened 9 months ago

BenWibking commented 9 months ago

Is there a recommended way to specify extra SLURM options for GPU bindings?

I tried using the args: batch block key (https://maestrowf.readthedocs.io/en/latest/Maestro/scheduling.html), but the options did not get propagated to the *.sh job script.

Following https://github.com/LLNL/maestrowf/issues/340, the workaround I've used so far is to specify these options as part of the run command so that they get copied into the job script:

    - name: run-sim
      description: Run the simulation.
      run:
          cmd: |
              #SBATCH --mem=0
              #SBATCH --constraint="scratch"
              #SBATCH --ntasks-per-node=4
              #SBATCH --cpus-per-task=16
              #SBATCH --gpus-per-task=1
              #SBATCH --gpu-bind=none

              srun bash -c "
                  export CUDA_VISIBLE_DEVICES=\$((3-SLURM_LOCALID));
                  $(BINARY_PATH) -i $(generate-infile.workspace)/params.in" > logfile.txt
          depends: [generate-infile]
          nodes: 1
          exclusive: True
          walltime: "00:10:00"
jwhite242 commented 9 months ago

So it doesn't look like there's great handling of gpu's on the slurm adapter at the moment, despite there being a hook for adding the gpus=.. bit to the header which I think passes through on the steps' 'gpus: ' key along side nodes/procs/etc. Looks like the only extra one explicitly supported is 'cores per task'. Also note these are decoupled a bit in the script adapters: the header applies to the entire batch job (along with with the batch block keys), while many of the keys attached to the step get applied independently to each srun when using the $(LAUNCHER) syntax which has some limited support for independently specifying procs/nodes per launcher invocation.

And just to better understand the final use case, are you also looking for having say 4 different tasks (or srun's) inside this step, one per gpu, or perhaps preferring to keep each one separate and pack the allocation with many jobs using an embedded flux instance? Either way it looks like we'll need to wire up some extra hooks/pass through for these gpu related args in the slurm adapter. Think we could also add some 'c', and 'g' flags to the new launcher syntax if you want more independent control of multiple $(LAUNCHER) tokens in a step (see this new style launcher)

BenWibking commented 9 months ago

The use case for this job step is just a single MPI job across 1+ nodes. (Other job steps workflow steps are CPU-only, so they need to have a different binding/ntasks-per-node, but for now, that's a separate issue.)

The somewhat nonstandard options are just to get the right mapping of NUMA domains to GPUs due to the weird topology on this system, plus a workaround to avoid cgroup resource isolation being applied to the GPUs (since that prevents CUDA IPC from working between MPI ranks).

Using

#SBATCH --gpu-bind=none
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4

might accomplish the same binding, but I haven't tested that yet. Is there a built-in way to specify this alternative set of SLURM options?

jwhite242 commented 9 months ago

No, it doesn't look like there's a better way built in to set any extra/un-known sbatch options than what you're currently doing by putting them at the top of your step cmd.

Will have to look into exposing more of these options/bindings across the script adapters. Other than the options in your initial snippet, are there any other of the numerous sbatch/srun options you would be interested in?

BenWibking commented 9 months ago

No, it doesn't look like there's a better way built in to set any extra/un-known sbatch options than what you're currently doing by putting them at the top of your step cmd.

Will have to look into exposing more of these options/bindings across the script adapters. Other than the options in your initial snippet, are there any other of the numerous sbatch/srun options you would be interested in?

Not that I can think of. The above examples should cover it.