NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

Using a rankfile with srun from a container #102

Closed OguzPastirmaci closed 1 year ago

OguzPastirmaci commented 1 year ago

When using mpirun directly, I can pass an argument --rankfile to use an ordered rankfile. How can I do that with srun with Pyxis/Enroot? I tried creating the below hook for Enroot but it didn't work:

#! /bin/bash

echo "OMPI_MCA_orte_launch_agent=enroot start ${ENROOT_ROOTFS##*/} orted --rankfile <path to rankfile>" >> "${ENROOT_ENVIRON}"

This is the srun block in my submission script:

srun --mpi=pmi2 --gpus-per-node=8 \
     --ntasks-per-node=8 \
     --container-image="/nfs/scratch/pytorch:22.12-py3.sqsh" \
     --container-mounts="/home/opc:/nccl,$LOCAL_MPI:$LOCAL_MPI" \
     bash -c "
     source /opt/openmpi-4.1.4/bin/mpivars.sh &&
     /nccl/nccl-tests/build/all_reduce_perf -b1G -e10G -i$((1024*1024*1024*9)) -n 100
     "
flx42 commented 1 year ago

As far as I know, you can't do that with PMI2 / PMIx, but that's not specific to enroot or pyxis, that is also the case on bare-metal.

You have to make sure that the topology defined in slurm.conf gives you the right ordering of ranks, you can set https://slurm.schedmd.com/slurm.conf.html#OPT_Weight to influence the ordering of nodes made by Slurm.

arnaudfroidmont commented 1 year ago

@flx42, the topology file is defined in our Slurm but the nodes are still ordered in alphabetical order in the hostile generated by srun, rather than grouping nodes on the same switch close to each other.

flx42 commented 1 year ago

If you mean topology.conf, I don't think it influences the node ordering. I believe what matters is the order in which you declare nodes in slurm.conf with NodeName lines.

flx42 commented 1 year ago

Ok, sorry but I was wrong above, my memory was fuzzy on this topic. The ordering of switches (or nodes, or the weights of nodes) will only impact node selection, it will not reorder the selected nodes. I actually described this situation a while ago here: https://bugs.schedmd.com/show_bug.cgi?id=9521

flx42 commented 1 year ago

Note that my patch in https://bugs.schedmd.com/show_bug.cgi?id=9521 still works after the trivial fix of . to -> for node records, see new patch for Slurm 22.05: 0001-topology-tree-assign-a-unique-rank-to-nodes-on-the-s.patch

With the following in topology.conf:

SwitchName=leaf01  Nodes=node-1,node-4
SwitchName=leaf02  Nodes=node-3,node-2
SwitchName=spine   Switches=leaf01,leaf02

By default, without my patch, the numbering (using SLURM_NODEID here) will not follow the topology:

$ srun -N4 bash -c 'printf "%d - %s\n" $SLURM_NODEID $SLURMD_NODENAME' | sort
0 - node-1
1 - node-2
2 - node-3
3 - node-4

With my patch:

$ srun -N4 bash -c 'printf "%d - %s\n" $SLURM_NODEID $SLURMD_NODENAME' | sort
0 - node-1
1 - node-4
2 - node-2
3 - node-3

From the slurmctld log:

slurmctld: topology/tree: topo_generate_node_ranking: topo_generate_node_ranking: node=node-1 rank=1
slurmctld: topology/tree: topo_generate_node_ranking: topo_generate_node_ranking: node=node-4 rank=1
slurmctld: topology/tree: topo_generate_node_ranking: topo_generate_node_ranking: node=node-2 rank=2
slurmctld: topology/tree: topo_generate_node_ranking: topo_generate_node_ranking: node=node-3 rank=2
flx42 commented 1 year ago

Something else to try would be to modify NodeName and NodeHostname, I can't test on a single node setup with multiple slurmd, but it might work with something like the following:

NodeName=node-0001 NodeHostname=zzz
NodeName=node-0002 NodeHostname=aaa
NodeName=node-0003 NodeHostname=ccc
NodeName=node-0004 NodeHostname=bbb
arnaudfroidmont commented 1 year ago

I like the patch, that's exactly what I'd want to see from Slurm. Did the patch make it into any release? The solution we have found is to use a sbatch file to generate the hosts, reorder the filename and export the SLURM_HOSTFILE env variable If you call srun from within the sbatch, you can use --distribution=arbitrary and it will keep the reordered hostile.

flx42 commented 1 year ago

No it didn't unfortunately, I will check if I can get this conversation restarted.

arnaudfroidmont commented 1 year ago

Awesome, thanks!

flx42 commented 1 year ago

Closing this issue then.

Thanks for the SLURM_HOSTFILE trick, I was not aware of this approach, that's neat.

flx42 commented 1 year ago

Happy to report that the patch was merged in Slurm 23.02, behind a slurm.conf flag: https://github.com/SchedMD/slurm/blob/slurm-23-02-0-1/NEWS#L27-L29

arnaudfroidmont commented 1 year ago

That is awesome! Thanks @flx42