geoschem / GCHP

The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
https://gchp.readthedocs.io
Other
22 stars 25 forks source link

Default GCHP run crashes almost immediately in MAPL_CapGridComp.F90 #8

Closed LiamBindle closed 4 years ago

LiamBindle commented 4 years ago

Hi everyone,

I'm just submitting this for the archive of issues on GitHub.

Relevent Information

What happened

Yesterday I tried running the default 6-core 1-node 1-hour GCHP simulation and it crashed almost immediately. This happned with GHCP_CTM 13.0.0-alpha.1, but this could happen with any version that uses MAPL 2.0+. Below is the full output. The important parts to pick out are:

  1. It failed almost immediately (very little output).
  2. The "Abort(XXXXXX) on node Y" lines report GCHP is running on different nodes despite this being a 6-core single node simulation.
  3. GCHP crashed after the assertion on line 250 of MAPL_CapGridComp.F90 failed (permalink here)

Failed run output:

 In MAPL_Shmem:
     NumCores per Node =            6
     NumNodes in use   =            1
     Total PEs         =            6
 In MAPL_InitializeShmem (NodeRootsComm):
     NumNodes in use   =            1
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
pe=00001 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00001 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00001 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 1
pe=00002 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00002 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00029    GEOSChem.F90                             <status=1>
pe=00003 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00003 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00003 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00029    GEOSChem.F90                             <status=1>
pe=00000 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00000 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00000 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 2
Abort(262146) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 3
pe=00004 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00004 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00004 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 4
pe=00005 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00005 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 5
Abort(262146) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 0

The Problem

The issue was ESMF was built with ESMF_COMM=mpiuni. This appears to have happended because the spack install spec wasn't quite right, but I didn't build ESMF myself so I can't be sure.

How do I check which ESMF_COMM my ESMF was built with?

The build-time value of ESMF_COMM is written to esmf.mk beside your ESMF libraries. You can see it with the following command

grep 'ESMF_COMM' $(spack location -i esmf)/lib/esmf.mk

or

grep 'ESMF_COMM' /path/to/ESMF/libraries/esmf.mk

Solution

Rebuild ESMF and make sure ESMF_COMM is set to the appropriate MPI flavor.

lizziel commented 4 years ago

Thanks @LiamBindle. Do you know if using mpiuni set mpi to automatically use one core per node? Does this implementation exist for some specific purpose?

LiamBindle commented 4 years ago

@lizziel According to this mpiuni is "a single-processor MPI-bypass library". I don't really understand what that means, but it seems to fit inline with what I was seeing (each process thinking they were root).

Actually, last night I just noticed ESMF_COMM wasn't "intelmpi", and I thought it should be, so I rebuilt ESMF and it fixed my problem. I tried to generalize this issue for the purpose of the issue archive. This issue is really "ESMF was built with the wrong ESMF_COMM" and isn't because of a ESMF + Intel MPI compatibility problem.

Here is the part of spack that sets ESMF_COMM. It looks like ESMF_COMM=intelmpi iff +mpi and ^intel-parallel-studio+mpi is in the spack install spec. Our compute1 sysadmin was having trouble getting spack to concretize this spec though, and ultimately opted to build ESMF manually rather than with spack. I suspect someone familiar with spack could advise on how to do this properly, but rebuilding ESMF manually was easy enough.