Open cwsmith opened 2 years ago
Given the error server accept cannot find guid
and the comment about IP addresses needing to be unique here:
https://github.com/open-mpi/ompi/issues/8257#issuecomment-734356200
I took a look at the ip addresses of the compute nodes and saw that for the virbr0
interface they are the same.
[exouser@gkeyll-vc-test00-compute-0 ~]$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc fq_codel state UP group default qlen 1000
link/ether fa:16:3e:3a:3e:2d brd ff:ff:ff:ff:ff:ff
altname enp0s3
altname ens3
inet 10.0.207.42/24 brd 10.0.207.255 scope global dynamic noprefixroute eth0
valid_lft 86172sec preferred_lft 86172sec
inet6 fe80::f816:3eff:fe3a:3e2d/64 scope link
valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 52:54:00:79:d2:b8 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
[exouser@gkeyll-vc-test00-compute-1 ~]$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8900 qdisc fq_codel state UP group default qlen 1000
link/ether fa:16:3e:4b:b2:67 brd ff:ff:ff:ff:ff:ff
altname enp0s3
altname ens3
inet 10.0.207.231/24 brd 10.0.207.255 scope global dynamic noprefixroute eth0
valid_lft 86150sec preferred_lft 86150sec
inet6 fe80::f816:3eff:fe4b:b267/64 scope link
valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 52:54:00:79:d2:b8 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
To tell OpenMPI to use the eth0
interface I passed the --mca btl_tcp_if_include eth0
flag to mpirun
and the job ran successfully. At this point I'm not terribly concerned about performance, but if there is a way to have a faster network I'd like to use it (assuming that this is not using IB/libfabric/UCX).
@DImuthuUpe
Hello,
I'm hitting problems running MPI jobs that require more than one node using the system install of OpenMPI 4.1.1 in Rocky 8.6. Specifically, the following script runs on a single 2 core
m3.small
node withsbatch -n 2 -t 5 ./run.sh
:but fails when using four cores
sbatch -n 4 -t 5 ./run.sh
with the following message:The source for the
helloWorld
binary is:and was compiled with
mpicxx helloWorld.cc -o helloWorld
.I also tried running with
srun <binary> <args>
but it appears that openmpi was not built with slurm/pmi support.A quick google search on the error message led me to this discussion: https://github.com/open-mpi/ompi/issues/8257#issue-751169260