NCAR / spack-gust

Spack production user software stack on the Gust test system
4 stars 0 forks source link

openmpi/4.1.4 under ncarenv/22.10 breaks with MPI_Ssend #32

Open benkirk opened 1 year ago

benkirk commented 1 year ago

It seems our openmpi/4.1.4 in ncarenv/22.10 fails with synchronous send modes.

To reproduce:

module reset && module load gcc openmpi
[ -f hello_world_mpi_ssend_recv.C ] || wget https://gist.githubusercontent.com/benkirk/15aea836fa7feb9636bc7e799e714c15/raw/df0db410535ff14d60c4ca37b336b6d1adc28c4d/hello_world_mpi_ssend_recv.C
mpicxx -o hello_world_mpi_ssend_recv hello_world_mpi_ssend_recv.C
qcmd -q main -l select=1:ncpus=2:mpiprocs=2 -l walltime=00:30:00 -A SCSG0001 -- mpiexec -n 2 --mca opal_warn_on_missing_libcuda 0 ./hello_world_mpi_ssend_recv

Output:

Waiting on job launch; 6122.gusched01 with qsub arguments:
    qsub  -l select=1:ncpus=2:mpiprocs=2 -A SCSG0001 -q main@gusched01 -l walltime=00:30:00

--------------------------------------------------------------------------
The library attempted to open the following supporting CUDA libraries,
but each of them failed.  CUDA-aware support is disabled.
libcuda.so.1: cannot open shared object file: No such file or directory
libcuda.dylib: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.so.1: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.dylib: cannot open shared object file: No such file or directory
If you are not interested in CUDA-aware support, then run with
--mca opal_warn_on_missing_libcuda 0 to suppress this message.  If you are interested
in CUDA-aware support, then try setting LD_LIBRARY_PATH to the location
of libcuda.so.1 to get passed this issue.
--------------------------------------------------------------------------
Hello from 0 / gu0013, running ./hello_world_mpi_ssend_recv on 2 ranks
Hello from 1 / gu0013, running ./hello_world_mpi_ssend_recv on 2 ranks
calling MPI_Send...done
calling MPI_Isend...done
[gu0013:207035] *** An error occurred in MPI_Recv
[gu0013:207035] *** reported by process [354680833,1]
[gu0013:207035] *** on communicator MPI_COMM_WORLD
[gu0013:207035] *** MPI_ERR_OTHER: known error not in list
[gu0013:207035] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gu0013:207035] ***    and potentially your MPI job)
[gu0013:207030] 1 more process has sent help message help-mpi-common-cuda.txt / dlopen failed
[gu0013:207030] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
benkirk commented 1 year ago

OpenMPI is aware of this too, for a year now: https://github.com/open-mpi/ompi/issues/10210