Problem running multi-node MPI jobs

cwsmith commented 2 years ago

Hello,

I'm hitting problems running MPI jobs that require more than one node using the system install of OpenMPI 4.1.1 in Rocky 8.6. Specifically, the following script runs on a single 2 core m3.small node with sbatch -n 2 -t 5 ./run.sh:

#!/bin/bash -ex
hosts=hostfile.${SLURM_JOBID}
srun hostname > $hosts
mpirun -n ${SLURM_NPROCS} -hostfile $hosts ./helloWorld
echo "done"

but fails when using four cores sbatch -n 4 -t 5 ./run.sh with the following message:

$ cat slurm-55.out 
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output (/opt/ohpc/admin/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ hosts=hostfile.55
+ srun hostname
+ mpirun -n 4 -hostfile hostfile.55 ./helloWorld
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: gkeyll-vc-test00-compute-2
  PID:        15413
--------------------------------------------------------------------------
[gkeyll-vc-test00-compute-1.novalocal:15657] 3 more processes have sent help message help-mpi-btl-tcp.txt / server accept cannot find guid
[gkeyll-vc-test00-compute-1.novalocal:15657] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The source for the helloWorld binary is:

#include<mpi.h>
#include<stdio.h>

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);
  int worldSize, rank;
  MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  int local = 1;
  int global;
  MPI_Allreduce(&local, &global, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
  fprintf(stderr, "%d\n", rank);
  MPI_Finalize();
  return (global!=worldSize);
}

and was compiled with mpicxx helloWorld.cc -o helloWorld.

I also tried running with srun <binary> <args> but it appears that openmpi was not built with slurm/pmi support.

A quick google search on the error message led me to this discussion: https://github.com/open-mpi/ompi/issues/8257#issue-751169260

cwsmith commented 2 years ago

Given the error server accept cannot find guid and the comment about IP addresses needing to be unique here: https://github.com/open-mpi/ompi/issues/8257#issuecomment-734356200 I took a look at the ip addresses of the compute nodes and saw that for the virbr0 interface they are the same.

[exouser@gkeyll-vc-test00-compute-0 ~]$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:3a:3e:2d brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    altname ens3
    inet 10.0.207.42/24 brd 10.0.207.255 scope global dynamic noprefixroute eth0
       valid_lft 86172sec preferred_lft 86172sec
    inet6 fe80::f816:3eff:fe3a:3e2d/64 scope link 
       valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 52:54:00:79:d2:b8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever

[exouser@gkeyll-vc-test00-compute-1 ~]$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:4b:b2:67 brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    altname ens3
    inet 10.0.207.231/24 brd 10.0.207.255 scope global dynamic noprefixroute eth0
       valid_lft 86150sec preferred_lft 86150sec
    inet6 fe80::f816:3eff:fe4b:b267/64 scope link 
       valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 52:54:00:79:d2:b8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever

To tell OpenMPI to use the eth0 interface I passed the --mca btl_tcp_if_include eth0 flag to mpirun and the job ran successfully. At this point I'm not terribly concerned about performance, but if there is a way to have a faster network I'd like to use it (assuming that this is not using IB/libfabric/UCX).

cwsmith commented 1 year ago

@DImuthuUpe

access-ci-org / Jetstream_Cluster

Problem running multi-node MPI jobs #13