NCAR / spack-gust

Spack production user software stack on the Gust test system
4 stars 0 forks source link

Issue with MPS #12

Open sjsprecious opened 2 years ago

sjsprecious commented 2 years ago

I built a simple MPI+OpenACC example on Gust. It was compiled successfully but ran into an error when I turned on MPS.

The error message looks like: gu0017.hsn.gu.hpc.ucar.edu: rank 0 died from signal 11 and dumped core

If I turned off MPS and re-ran the program, it worked just fine.

The run command is mpiexec --cpu-bind depth -n 1 -ppn 1 -d 1 ./mpi_mps.exe and I only request a single GPU. I tried 2 MPI ranks per node and got the same error.

My environment module list is: 1) ncarenv/22.08 (S) 3) nvhpc/22.7 5) cray-mpich/8.1.18 2) craype/2.7.17 (S) 4) ncarcompilers/0.6.2

vanderwb commented 1 year ago

This might be tied to an issue with MPI & GPUs and the OS version on the compute nodes. @jbaksta is looking at upgrading the OS to the supported version soon. If nothing else, that will allow us to report segfaults and get support.

vanderwb commented 1 year ago

Currently cray-mpich is not working with MPS. We are working with HPE to resolve. An OpenMPI build has been introduced. I recommend you give that a try.

vanderwb commented 1 year ago

So this does seem to be resolved on Derecho with newer network drivers. I am fairly confident MPS will be working properly for you when we launch.