Closed OguzPastirmaci closed 1 year ago
As far as I know, you can't do that with PMI2 / PMIx, but that's not specific to enroot or pyxis, that is also the case on bare-metal.
You have to make sure that the topology defined in slurm.conf
gives you the right ordering of ranks, you can set https://slurm.schedmd.com/slurm.conf.html#OPT_Weight to influence the ordering of nodes made by Slurm.
@flx42, the topology file is defined in our Slurm but the nodes are still ordered in alphabetical order in the hostile generated by srun, rather than grouping nodes on the same switch close to each other.
If you mean topology.conf
, I don't think it influences the node ordering. I believe what matters is the order in which you declare nodes in slurm.conf
with NodeName
lines.
Ok, sorry but I was wrong above, my memory was fuzzy on this topic. The ordering of switches (or nodes, or the weights of nodes) will only impact node selection, it will not reorder the selected nodes. I actually described this situation a while ago here: https://bugs.schedmd.com/show_bug.cgi?id=9521
Note that my patch in https://bugs.schedmd.com/show_bug.cgi?id=9521 still works after the trivial fix of .
to ->
for node records, see new patch for Slurm 22.05: 0001-topology-tree-assign-a-unique-rank-to-nodes-on-the-s.patch
With the following in topology.conf
:
SwitchName=leaf01 Nodes=node-1,node-4
SwitchName=leaf02 Nodes=node-3,node-2
SwitchName=spine Switches=leaf01,leaf02
By default, without my patch, the numbering (using SLURM_NODEID
here) will not follow the topology:
$ srun -N4 bash -c 'printf "%d - %s\n" $SLURM_NODEID $SLURMD_NODENAME' | sort
0 - node-1
1 - node-2
2 - node-3
3 - node-4
With my patch:
$ srun -N4 bash -c 'printf "%d - %s\n" $SLURM_NODEID $SLURMD_NODENAME' | sort
0 - node-1
1 - node-4
2 - node-2
3 - node-3
From the slurmctld log:
slurmctld: topology/tree: topo_generate_node_ranking: topo_generate_node_ranking: node=node-1 rank=1
slurmctld: topology/tree: topo_generate_node_ranking: topo_generate_node_ranking: node=node-4 rank=1
slurmctld: topology/tree: topo_generate_node_ranking: topo_generate_node_ranking: node=node-2 rank=2
slurmctld: topology/tree: topo_generate_node_ranking: topo_generate_node_ranking: node=node-3 rank=2
Something else to try would be to modify NodeName and NodeHostname, I can't test on a single node setup with multiple slurmd, but it might work with something like the following:
NodeName=node-0001 NodeHostname=zzz
NodeName=node-0002 NodeHostname=aaa
NodeName=node-0003 NodeHostname=ccc
NodeName=node-0004 NodeHostname=bbb
I like the patch, that's exactly what I'd want to see from Slurm. Did the patch make it into any release? The solution we have found is to use a sbatch file to generate the hosts, reorder the filename and export the SLURM_HOSTFILE env variable If you call srun from within the sbatch, you can use --distribution=arbitrary and it will keep the reordered hostile.
No it didn't unfortunately, I will check if I can get this conversation restarted.
Awesome, thanks!
Closing this issue then.
Thanks for the SLURM_HOSTFILE
trick, I was not aware of this approach, that's neat.
Happy to report that the patch was merged in Slurm 23.02, behind a slurm.conf flag: https://github.com/SchedMD/slurm/blob/slurm-23-02-0-1/NEWS#L27-L29
That is awesome! Thanks @flx42
When using mpirun directly, I can pass an argument
--rankfile
to use an ordered rankfile. How can I do that with srun with Pyxis/Enroot? I tried creating the below hook for Enroot but it didn't work:This is the
srun
block in my submission script: