Open sjsprecious opened 2 years ago
This might be tied to an issue with MPI & GPUs and the OS version on the compute nodes. @jbaksta is looking at upgrading the OS to the supported version soon. If nothing else, that will allow us to report segfaults and get support.
Currently cray-mpich is not working with MPS. We are working with HPE to resolve. An OpenMPI build has been introduced. I recommend you give that a try.
So this does seem to be resolved on Derecho with newer network drivers. I am fairly confident MPS will be working properly for you when we launch.
I built a simple MPI+OpenACC example on Gust. It was compiled successfully but ran into an error when I turned on MPS.
The error message looks like:
gu0017.hsn.gu.hpc.ucar.edu: rank 0 died from signal 11 and dumped core
If I turned off MPS and re-ran the program, it worked just fine.
The run command is
mpiexec --cpu-bind depth -n 1 -ppn 1 -d 1 ./mpi_mps.exe
and I only request a single GPU. I tried 2 MPI ranks per node and got the same error.My environment module list is: 1) ncarenv/22.08 (S) 3) nvhpc/22.7 5) cray-mpich/8.1.18 2) craype/2.7.17 (S) 4) ncarcompilers/0.6.2