use of MPI_Alltoallv in rearrange_ causes hangs

worleyph commented 5 years ago

In 2010, Sheri Mickelson identified and diagnosed that using MPIAlltoallv in rearrange (in mRearranger.F90) often causes hangs in CCSM. This is also true in E3SM and CESM, as I recently rediscovered. The issue is that rearrange uses the communicator ThisMCTWorld%MCT_comm, which seems to always be set to MPI_COMMWORLD (or something with the same processes), while the rearrange routine itself is sometimes not called by all of the processes in MPI_COMM_WORLD. The collective MPI_Alltoallv requires participation by all processes in the relevant communicator, and so hangs in this situation.

(Aside: MPI_Alltoallv was added as a performance optimization in 2006 for the Cray vector systems, and the Cray MPI library at the time did not require all processes to participate in the call to MPI_Alltoallv. Sheri discovered the problem when trying to run on IBM Intrepid system at Argonne. I have not looked into what the MPI standard says, but the MPI libraries we are using at NERSC and OLCF both require participation of all processes in the communicator.)

For E3SM, the hang will occur if either CPL or ROF do not have communicators with the same processes as MPI_COMM_WORLD.

To avoid future rediscoveries of this issue, I believe that the MPIAlltoallv option should be removed from rearrange . The swapm alternative implementation is what we currently use when usealltoall=.true., so nothing will be impacted by this change as far as I know.

The other option is to "fix" rearrange_ so that it uses a subcommunicator that reflects what processes are actually calling the routine. While this is the most elegant solution, in my opinion, I don't know what it would entail, or whether this is even doable within the current MCT design. If we did take this approach, MPI_Alltoallv should then be a valid option?

amametjanov commented 5 years ago

I am optimistic that the second option is doable: will submit a PR after some cleanup.

worleyph commented 5 years ago

Great!

rljacob commented 5 years ago

You're right that rearrange, like Router, is using a copy of MPI_COMM_WORLD. That was doable because all of those routines were originally written with paired Send/Recv's.

I'm really surprised this wasn't noticed sooner. All those NERSC machines and NCAR machines CESM was run on with MCT. usealltoall is by default false. Did we just not use it very often? And when we did it was on a cray or a stacked layout (where all models share all processors)?

worleyph commented 5 years ago

As I mentioned above, mpi_alltoallv option was added specifically for the Cray vector system (X1? X1E?), and the Cray MPI library at the time did not require that all processes in the communicator call the this collective. The problem was identified in 2010 by Sheri when she tried to use the usealltoall option on Intrepid, and I added a swapm option to usealltoall to work around this. At the time "we" knew that this was broken on some systems, but I forgot in the mean time. I've been using swapm in my work on DOE systems. I don't know how CESM has been run on other systems.

rljacob commented 5 years ago

I remember that Cray was the motivation but didn't know (or forgot) that was also the only system it was supposed to be used on.

worleyph commented 5 years ago

I remember that Cray was the motivation but didn't know (or forgot) that was also the only system it was supposed to be used on.

I don't think that it was clear at the time that it was only for the Cray system. Cray collectives were much better than point-to-point (on the Cray vector system), but collectives were supposedly "good things" to use on all systems.

MCSclimate / MCT

use of MPI_Alltoallv in rearrange_ causes hangs #54