geoschem / gchp_legacy

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.
http://wiki.geos-chem.org/GEOS-Chem_HP
Other
7 stars 13 forks source link

[BUG/ISSUE] GCHP dies when all diagnostic collections are turned off #9

Closed yantosca closed 5 years ago

yantosca commented 5 years ago

Have been looking at this issue.

Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66)

In HISTORY.rc I turned on only these collections (1) SpeciesConc_avg : only archived SpeciesConc_NO (2) SpeciesConc_inst : only archived SpeciesConc_NO (3) StateMet_avg : only archived Met_AD, Met_OPTD, Met_PSC2DRY, Met_PSC2WET, Met_SPHU, Met_TropHt, Met_TropLev, Met_TropP (4) StateMet_inst: only archived Met_AD

This run (1 hour) on 6 cores finished with all timing information:

GIGCenv: total 0.346 GIGCchem total: 123.970 Dynamics total: 18.741 GCHP total: 140.931 HIST total: 0.264 EXTDATA total: 133.351

So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know.

This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud?

Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem.

But if I run with no diagnostics turned on then the run dies at 10 minutes

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% USING O3 COLUMNS FROM THE MET FIELDS! %%% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     - RDAER: Using online SO4 NH4 NIT!
     - RDAER: Using online BCPI OCPI BCPO OCPO!
     - RDAER: Using online SALA SALC
     - DO_STRAT_CHEM: Linearized strat chemistry at 2016/07/01 00:00
###############################################################################
# Interpolating Linoz fields for jul
###############################################################################
     - LINOZ_CHEM3: Doing LINOZ
===============================================================================
Successfully initialized ISORROPIA code II
===============================================================================
  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!

 Setting history variable pointers to GC and Export States:
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.638E+03  4.409E+03      2.223E+03  2.601E+03  3.258E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.823E+04  0.000E+00
MAPL_ExtDataInterpField                       3300
EXTDATA::Run_                                 1471
MAPL_Cap                                       777
application called MPI_Abort(MPI_COMM_WORLD, 21856) - process 0

From the traceback it looks as if it's hanging in interpolating a field in ExtData.

yantosca commented 5 years ago

Jiawei Zhuang wrote:

The run also crashes at 00:10 if I only save one collection SpeciesConc_inst with only two species SpeciesConc_NO and SpeciesConc_O3 in it.

  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!

 Setting history variable pointers to GC and Export States:
 SpeciesConc_NO
 SpeciesConc_O3
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.723E+03  4.494E+03  2.306E+03  2.684E+03  3.260E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.852E+04  0.000E+00
 offline_tracer_advection
ESMFL_StateGetPtrToDataR4_3                     54
DYNAMICSRun                                    703
GCHP::Run                                      407
MAPL_Cap                                       792

But with two collections SpeciesConc_avg and SpeciesConc_inst, each with only two species SpeciesConc_NO and SpeciesConc_O3 in it, the run is able to finish and print full timing information:


 Writing:    144 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_avg.20160701_0530z.nc4
 Writing:    144 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0600z.nc4

  Times for GIGCenv
TOTAL                   :       2.252
INITIALIZE              :       0.000
RUN                     :       2.250
GenInitTot              :       0.004
--GenInitMine           :       0.003
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.000
--GenFinalMine          :       0.000
GenRecordTot            :       0.001
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

HEMCO::Finalize... OK.
Chem::Input_Opt Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
   Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc

  Times for GIGCchem
TOTAL                   :     505.760
INITIALIZE              :       3.617
RUN                     :     498.376
FINALIZE                :       0.000
DO_CHEM                 :     488.864
CP_BFRE                 :       0.121
CP_AFTR                 :       4.080
GC_CONV                 :      36.070
GC_EMIS                 :       0.000
GC_DRYDEP               :       0.119
GC_FLUXES               :       0.000
GC_TURB                 :      17.966
GC_CHEM                 :     403.528
GC_WETDEP               :      19.443
GC_DIAGN                :       0.000
GenInitTot              :       2.719
--GenInitMine           :       2.719
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.963
--GenFinalMine          :       0.963
GenRecordTot            :       0.000
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

   -----------------------------------------------------
      Block          User time  System Time   Total Time
   -----------------------------------------------------
   TOTAL                      815.4433       0.0000     815.4433
   COMM_TOTAL                   3.3098       0.0000       3.3098
   COMM_TRAC                    3.3097       0.0000       3.3097
   FV_TP_2D                    90.1448       0.0000      90.1448

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3126 RUNNING AT ip-172-31-0-74
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
lizziel commented 5 years ago

This issue is not reproducible on the Harvard Odyssey cluster. If you repeat the same tests multiple times do you always get the same result? Do you bypass the issue if transport is turned off (turn off in runConfig.sh not input.geos)?

yantosca commented 5 years ago

On the AWS cloud, I can faithfully reproduce the run dying at 00:10 when all collections are turned off in HISTORY.rc.

With all collections turned off AND with transport turned off, the run still fails at 00:10.

JiaweiZhuang commented 5 years ago

Using OpenMPI 2.1 instead of MPICH 3.3 fixes this problem #10 But then it runs into the problem of not being able to save diagnostics.

lizziel commented 5 years ago

Upgrading to OpenMPI 3 may fix the remaining issue. We ran into this on the Odyssey cluster and switching to the new OpenMPI fixed it.

lizziel commented 5 years ago

I am closing this issue since it is fixed by switching to OpenMPI 2.1 from MPICH 3.3.