Closed yantosca closed 5 years ago
Jiawei Zhuang wrote:
The run also crashes at 00:10 if I only save one collection SpeciesConc_inst
with only two species SpeciesConc_NO
and SpeciesConc_O3
in it.
--- Chemistry done!
--- Do wetdep now
--- Wetdep done!
Setting history variable pointers to GC and Export States:
SpeciesConc_NO
SpeciesConc_O3
AGCM Date: 2016/07/01 Time: 00:10:00
Memuse(MB) at MAPL_Cap:TimeLoop= 4.723E+03 4.494E+03 2.306E+03 2.684E+03 3.260E+03
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop= 1.852E+04 0.000E+00
offline_tracer_advection
ESMFL_StateGetPtrToDataR4_3 54
DYNAMICSRun 703
GCHP::Run 407
MAPL_Cap 792
But with two collections SpeciesConc_avg
and SpeciesConc_inst
, each with only two species SpeciesConc_NO
and SpeciesConc_O3
in it, the run is able to finish and print full timing information:
Writing: 144 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_avg.20160701_0530z.nc4
Writing: 144 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_inst.20160701_0600z.nc4
Times for GIGCenv
TOTAL : 2.252
INITIALIZE : 0.000
RUN : 2.250
GenInitTot : 0.004
--GenInitMine : 0.003
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.000
--GenFinalMine : 0.000
GenRecordTot : 0.001
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
HEMCO::Finalize... OK.
Chem::Input_Opt Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc
Times for GIGCchem
TOTAL : 505.760
INITIALIZE : 3.617
RUN : 498.376
FINALIZE : 0.000
DO_CHEM : 488.864
CP_BFRE : 0.121
CP_AFTR : 4.080
GC_CONV : 36.070
GC_EMIS : 0.000
GC_DRYDEP : 0.119
GC_FLUXES : 0.000
GC_TURB : 17.966
GC_CHEM : 403.528
GC_WETDEP : 19.443
GC_DIAGN : 0.000
GenInitTot : 2.719
--GenInitMine : 2.719
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.963
--GenFinalMine : 0.963
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
-----------------------------------------------------
Block User time System Time Total Time
-----------------------------------------------------
TOTAL 815.4433 0.0000 815.4433
COMM_TOTAL 3.3098 0.0000 3.3098
COMM_TRAC 3.3097 0.0000 3.3097
FV_TP_2D 90.1448 0.0000 90.1448
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 3126 RUNNING AT ip-172-31-0-74
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
This issue is not reproducible on the Harvard Odyssey cluster. If you repeat the same tests multiple times do you always get the same result? Do you bypass the issue if transport is turned off (turn off in runConfig.sh not input.geos)?
On the AWS cloud, I can faithfully reproduce the run dying at 00:10 when all collections are turned off in HISTORY.rc.
With all collections turned off AND with transport turned off, the run still fails at 00:10.
Using OpenMPI 2.1 instead of MPICH 3.3 fixes this problem #10 But then it runs into the problem of not being able to save diagnostics.
Upgrading to OpenMPI 3 may fix the remaining issue. We ran into this on the Odyssey cluster and switching to the new OpenMPI fixed it.
I am closing this issue since it is fixed by switching to OpenMPI 2.1 from MPICH 3.3.
Have been looking at this issue.
Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66)
In HISTORY.rc I turned on only these collections (1) SpeciesConc_avg : only archived SpeciesConc_NO (2) SpeciesConc_inst : only archived SpeciesConc_NO (3) StateMet_avg : only archived Met_AD, Met_OPTD, Met_PSC2DRY, Met_PSC2WET, Met_SPHU, Met_TropHt, Met_TropLev, Met_TropP (4) StateMet_inst: only archived Met_AD
This run (1 hour) on 6 cores finished with all timing information:
GIGCenv: total 0.346 GIGCchem total: 123.970 Dynamics total: 18.741 GCHP total: 140.931 HIST total: 0.264 EXTDATA total: 133.351
So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know.
This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud?
Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem.
But if I run with no diagnostics turned on then the run dies at 10 minutes
From the traceback it looks as if it's hanging in interpolating a field in ExtData.