Open anhoward opened 3 years ago
We are seeing this with Abaqus. It's worth noting that we are confined to running in UK South where we only have H Series available, which do not have SR-IOV support and therefore limits us to Intel MPI. When HC Series lands later this year (with SR-IOV support), we expect to be able to use the MPI that ships with Abaqus and will see if this allows multi-node jobs to run when Slurm node names do not match host names.
It looks like this is now supported with v2.5.0 / v2.5.1
close?
Depending on the version of MPI or ISV code being used, occasionally they try to rely on the Slurm nodenames which aren't actual resolvable hostnames. This causes the jobs to fail.
It would be good if the actual hostnames on the nodes and in Azure DNS matched the nodename used in Slurm.