Closed crjones-amath closed 5 years ago
Paging @mrnorman @mt5555 for help/guidance on this.
I wonder why the job hung if this was launched with our mpirun.summit script. In our script, jsrun uses "-X 1" which was supposed to exit the job if any process/thread fails.
Closing this because it isn't reproducible.
Problem: FC5AV1C-L SP1 simulation on Summit failed (nstep 81) with MPI_allreduce error. It then failed to exit until the job walltime expired.
A previous 5-day timing simulation was successfully run with this exact case (same case; same executable; only changes were to STOP_N, REST_N, JOB_WALLCLOCK_TIME). Up to the point of crash, the e3sm log file is identical (except for the ordering of lines returned by different processes).
This is not the first time we have seen a crash with an MPI_ALLreduce error; and this is not the first time we've have an MPI-related crash that failed to exit. The failure to exit is possibly related to https://github.com/E3SM-Project/E3SM/issues/2847, but note that we are already using module
spectrum-mpi/10.2.0.11-20190201
on Summit.