Closed severinson closed 3 years ago
I've found a workaround. Adding the argument -x UCX_NET_DEVICES=eth0
to mpirun
solves the Connection refused
issue and adding --mca coll ^hcoll
removes the InfiniBand warnings. The updated jobfile (mpi_isend.job
) is:
#!/bin/sh -l
#SBATCH --job-name=pool
#SBATCH --output=pool.out
#SBATCH --nodes=2
#SBATCH --time=600:00
#SBATCH --tasks-per-node=1
#SBATCH --partition=hpc
mpirun --mca coll ^hcoll -x UCX_NET_DEVICES=eth0 ./mpi_isend
Going to mark this as closed since this is really related to MPI and the platform, not a CycleCloud or Slurm issue.
Hello,
OpenMPI was recently upgraded from version 4.0.5 to 4.1.0 on CycleCloud. Since the upgrade I'm having issues using non-blocking communication with Slurm on CycleCloud.
First, I have to use
-mca ^hcoll
to avoid warnings regarding InfiniBand, whichF2s_v2
nodes are not equipped with. I had this issue also with version 4.0.5.Second, since the recent upgrade to OpenMPI v4.1.0, non-blocking communication has stopped working for me. The code I have worked for OpenMPI v4.0.5.
This is the error I'm getting. I've confirmed that the problem occurs when I call
MPI_Isend
. I'm attaching a small example to reproduce this problem below.Program code (
mpi_isend.c
):Jobfile (
mpi_isend.job
):Steps to reproduce: