E3SM-Project / ACME-ECP

E3SM MMF for DoE ECP project
Other
9 stars 1 forks source link

ACME-ECP crashes with MPI_AllReduce error; fails to exit (Summit) #92

Closed crjones-amath closed 5 years ago

crjones-amath commented 5 years ago

Problem: FC5AV1C-L SP1 simulation on Summit failed (nstep 81) with MPI_allreduce error. It then failed to exit until the job walltime expired.

> tail e3sm.log.336544.190411-202639
[g32n01:32314] *** An error occurred in MPI_Allreduce
[g32n01:32314] *** reported by process [2154496201,608]
[g32n01:32314] *** on communicator MPI_COMM_WORLD
[g32n01:32314] *** MPI_ERR_COMM: invalid communicator
[g32n01:32314] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[g32n01:32314] ***    and potentially your MPI job)

A previous 5-day timing simulation was successfully run with this exact case (same case; same executable; only changes were to STOP_N, REST_N, JOB_WALLCLOCK_TIME). Up to the point of crash, the e3sm log file is identical (except for the ordering of lines returned by different processes).

This is not the first time we have seen a crash with an MPI_ALLreduce error; and this is not the first time we've have an MPI-related crash that failed to exit. The failure to exit is possibly related to https://github.com/E3SM-Project/E3SM/issues/2847, but note that we are already using module spectrum-mpi/10.2.0.11-20190201 on Summit.

crjones-amath commented 5 years ago

Paging @mrnorman @mt5555 for help/guidance on this.

sarats commented 5 years ago

I wonder why the job hung if this was launched with our mpirun.summit script. In our script, jsrun uses "-X 1" which was supposed to exit the job if any process/thread fails.

crjones-amath commented 5 years ago

Closing this because it isn't reproducible.