[BUG/ISSUE] Timing results are incomplete

JiaweiZhuang commented 5 years ago

I've successfully run GCHP 12.1.0 on AWS, with both Ubuntu (gcc 7.3.0) and CentOS (gcc 7.3.1). CentOS was already working before; I am glad that Ubuntu now also works (again with lots of dirty fixes). I strongly prefer Ubuntu as it has a lot more pre-packaged libraries and the environment is a lot faster to build (no need to compile libraries from source).

Full logs for record:

However, for both OS, the log files only show the time for GIGCenv, but no other components.

  Times for GIGCenv
TOTAL                   :       6.834
INITIALIZE              :       0.000
RUN                     :       6.832
GenInitTot              :       0.003
--GenInitMine           :       0.003
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.000
--GenFinalMine          :       0.000
GenRecordTot            :       0.001
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

HEMCO::Finalize... OK.
Chem::Input_Opt Finalize... OK.
Chem::State_Chm Finalize... OK.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 12335 RUNNING AT ip-172-31-80-21
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Does the same issue happen on Odyssey?

lizziel commented 5 years ago

We have run into this problem in the past. It is an issue with cleanup of state_met_mod (State_Met Finalize) and only occurs when the model was built with gfortran 6+. What version of fortran are you using? I will test with it on Odyssey in our dev branch.

JiaweiZhuang commented 5 years ago

I have tested with gfortran 7.2.1/7.3.0/7.3.1; all have the same issue

lizziel commented 5 years ago

I successfully compiled and ran to completion with gfortran 7.1.0 on Odyssey. Are you sure you are using 12.1.0 and not 12.0.0? There was an issue in 12.0.0 that caused this symptom and it was fixed in 12.0.1 with https://github.com/geoschem/geos-chem/commit/8084db057df1894e0fc3aa8c43883911b46921e6.

JiaweiZhuang commented 5 years ago

Yes I am using 12.1.0. I will build a public GCHP AMI today and you can use it for debugging. Does this sound good?

lizziel commented 5 years ago

Thanks, one of us will take a look. For what it's worth I also tried 8.2.0 on Odyssey with success.

JiaweiZhuang commented 5 years ago

OK I've made a public AMI: ami-0a5973f14aad7413a in region US East (N. Virginia)

Use at least r5.2xlarge to provide enough memory.

After login, simply

$ cd ~/tutorial/gchp_standard
$ mpirun -np 6 ./geos | tee run_gchp.log

I am still writing formal documentation but this should be enough for testing.

JiaweiZhuang commented 5 years ago

Interestingly, after I fix the restart file issue by https://github.com/geoschem/gchp/issues/8#issuecomment-446026513, even Times for GIGCenv is not printed. Now the run terminates even earlier:

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4
 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
 Writing:    510 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4
 Writing:    510 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4
                MAPL_CFIOWriteBundlePost      1908
HistoryRun                                    2947
MAPL_Cap                                       833

Full log files (only differ at the last few lines)

Use ami-0f44e999c80ef6e66 to reproduce this issue.

Update: this is not just about incomplete timing info. The run simply crashes at hour 1. See https://github.com/geoschem/gchp/issues/8#issuecomment-446044798

lizziel commented 5 years ago

This new issue (crash after 1hr run) will be addressed in https://github.com/geoschem/gchp/issues/8. To restrict the problem to the cleanup_state_met issue you can turn off all diagnostics collections in HISTORY.rc.

yantosca commented 5 years ago

Have been looking at this issue.

Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66)

In HISTORY.rc I turned on only these collections (1) SpeciesConc_avg : only archived SpeciesConc_NO (2) SpeciesConc_inst : only archived SpeciesConc_NO (3) StateMet_avg : only archived Met_AD, Met_OPTD, Met_PSC2DRY, Met_PSC2WET, Met_SPHU, Met_TropHt, Met_TropLev, Met_TropP (4) StateMet_inst: only archived Met_AD

This run (1 hour) on 6 cores finished with all timing information:

GIGCenv: total 0.346 GIGCchem total: 123.970 Dynamics total: 18.741 GCHP total: 140.931 HIST total: 0.264 EXTDATA total: 133.351

So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know.

This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud?

Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem.

yantosca commented 5 years ago

Also, if I run with no diagnostics turned on then the run dies at 10 minutes

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% USING O3 COLUMNS FROM THE MET FIELDS! %%% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     - RDAER: Using online SO4 NH4 NIT!
     - RDAER: Using online BCPI OCPI BCPO OCPO!
     - RDAER: Using online SALA SALC
     - DO_STRAT_CHEM: Linearized strat chemistry at 2016/07/01 00:00
###############################################################################
# Interpolating Linoz fields for jul
###############################################################################
     - LINOZ_CHEM3: Doing LINOZ
===============================================================================
Successfully initialized ISORROPIA code II
===============================================================================
  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!

 Setting history variable pointers to GC and Export States:
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.638E+03  4.409E+03      2.223E+03  2.601E+03  3.258E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.823E+04  0.000E+00
MAPL_ExtDataInterpField                       3300
EXTDATA::Run_                                 1471
MAPL_Cap                                       777
application called MPI_Abort(MPI_COMM_WORLD, 21856) - process 0

From the traceback it looks as if it's hanging in interpolating a field in ExtData.

JiaweiZhuang commented 5 years ago

The run also crashes at 00:10 if I only save one collection SpeciesConc_inst with only two species SpeciesConc_NO and SpeciesConc_O3 in it.

  --- Chemistry done!
  --- Do wetdep now
  --- Wetdep done!

 Setting history variable pointers to GC and Export States:
 SpeciesConc_NO
 SpeciesConc_O3
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.723E+03  4.494E+03  2.306E+03  2.684E+03  3.260E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.852E+04  0.000E+00
 offline_tracer_advection
ESMFL_StateGetPtrToDataR4_3                     54
DYNAMICSRun                                    703
GCHP::Run                                      407
MAPL_Cap                                       792

But with two collections SpeciesConc_avg and SpeciesConc_inst, each with only two species SpeciesConc_NO and SpeciesConc_O3 in it, the run is able to finish and print full timing information:

 Writing:    144 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_avg.20160701_0530z.nc4
 Writing:    144 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0600z.nc4

  Times for GIGCenv
TOTAL                   :       2.252
INITIALIZE              :       0.000
RUN                     :       2.250
GenInitTot              :       0.004
--GenInitMine           :       0.003
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.000
--GenFinalMine          :       0.000
GenRecordTot            :       0.001
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

HEMCO::Finalize... OK.
Chem::Input_Opt Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
   Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc

  Times for GIGCchem
TOTAL                   :     505.760
INITIALIZE              :       3.617
RUN                     :     498.376
FINALIZE                :       0.000
DO_CHEM                 :     488.864
CP_BFRE                 :       0.121
CP_AFTR                 :       4.080
GC_CONV                 :      36.070
GC_EMIS                 :       0.000
GC_DRYDEP               :       0.119
GC_FLUXES               :       0.000
GC_TURB                 :      17.966
GC_CHEM                 :     403.528
GC_WETDEP               :      19.443
GC_DIAGN                :       0.000
GenInitTot              :       2.719
--GenInitMine           :       2.719
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.963
--GenFinalMine          :       0.963
GenRecordTot            :       0.000
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

   -----------------------------------------------------
      Block          User time  System Time   Total Time
   -----------------------------------------------------
   TOTAL                      815.4433       0.0000     815.4433
   COMM_TOTAL                   3.3098       0.0000       3.3098
   COMM_TRAC                    3.3097       0.0000       3.3097
   FV_TP_2D                    90.1448       0.0000      90.1448

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3126 RUNNING AT ip-172-31-0-74
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

lizziel commented 5 years ago

Based on these reports I think the issue of not completing the timing report is not related to cleaning up the State_Met array. I ran into similar symptoms sometimes, but not all times, on the Harvard Odyssey cluster when there was an issue writing out diagnostic files. I found that if I reduced the total memory across all diagnostics then the run went to completion. Sometimes when the problem was encountered the log appeared to have a problem writing the timing.

This issue started on the Odyssey cluster after the operating system upgrade from CentOS6 to CentOS7. We were using OpenMPI 1.10.3 when the switch happening. Upgrading to OpenMPI 2 did not fix the issue. Then upgrading to OpenMPI 3 (either 3.0 or 3.1) corrected the problem for unknown reasons. I therefore think this issue is related to OS and MPI version compatibility, although we never figured out the cause so this is just a theory.

JiaweiZhuang commented 5 years ago

Tested with OpenMPI 2.2.1, still cannot print the full timing (#10)

run_openmpi_3hour_zero_collection.log

yantosca commented 5 years ago

This is choking in Cleanup_State_Met.

lizziel commented 5 years ago

I suggest seeing if this issue goes away after upgrading to OpenMPI 3.

If it does not, adding print statements through cleanup_state_met to see where it dies should give insight on if cleanup of a specific met-field is a problem. Many State_met fields point to MAPL imports so proper deallocation/nullify can be an issue due to the ordering of state_met and import cleanup.

JiaweiZhuang commented 5 years ago

Tested with gcc 7.3.0 + OpenMPI 3.1.3, the timing info is still incomplete. Again stops at Times for GIGCenv

run_openmpi3_default_diags.log

Can be reproduced in ami-01074a30392daa0f9

JiaweiZhuang commented 5 years ago

I cannot use the tricks for MPICH3 (saving SpeciesConc_inst and SpeciesConc_avg and turning off emission #10) to get full timing for OpenMPI3.

It always terminates at Times for GIGCenv

yantosca commented 5 years ago

I discovered that we were not deallocating State_Diag at the end of a GCHP run. So I now pass State_Diag from Chem_GridCompMod to routine GIGC_CHUNK_FINAL. Then within GIGC_CHUNK_FINAL, we call Cleanup_State_Diag. This now prints out all timing info. But the run still drops core at the end as in #11.

Here is a snippet of the log file at the last timestep of the run:

AGCM Date: 2016/07/01  Time: 03:00:00

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_avg.20160701_0230z.nc4
 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0300z.nc4

  Times for GIGCenv
TOTAL                   :       1.013
INITIALIZE              :       0.000
RUN                     :       1.009
GenInitTot              :       0.003
--GenInitMine           :       0.003
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.001
--GenFinalMine          :       0.001
GenRecordTot            :       0.000
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
   Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc

  Times for GIGCchem
TOTAL                   :     254.818
INITIALIZE              :       6.647
RUN                     :     244.785
FINALIZE                :       0.000
DO_CHEM                 :     240.695
CP_BFRE                 :       0.048
CP_AFTR                 :       1.741
GC_CONV                 :      15.002
GC_EMIS                 :      12.660
GC_DRYDEP               :       0.042
GC_FLUXES               :       0.000
GC_TURB                 :       8.096
GC_CHEM                 :     193.611
GC_WETDEP               :      10.119
GC_DIAGN                :       0.000
GenInitTot              :       2.440
--GenInitMine           :       2.440
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.744
--GenFinalMine          :       0.744
GenRecordTot            :       0.000
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

   -----------------------------------------------------
      Block          User time  System Time   Total Time
   -----------------------------------------------------
   TOTAL                      506.1147       0.0000     506.1147
   COMM_TOTAL                   1.1536       0.0000       1.1536
   COMM_TRAC                    1.1536       0.0000       1.1536
   FV_TP_2D                    40.6695       0.0000      40.6695

  Times for DYNAMICS
TOTAL                   :      56.285
INITIALIZE              :       2.143
RUN                     :      56.273
FINALIZE                :       0.000
GenInitTot              :       0.011
--GenInitMine           :       0.011
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.001
--GenFinalMine          :       0.001
GenRecordTot            :       0.001
--GenRecordMine         :       0.001
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

  Times for GCHP
TOTAL                   :     310.187
RUN                     :     301.969
GIGCenv                 :       1.015
GIGCchem                :     252.394
DYNAMICS                :      58.406
GenInitTot              :       7.447
--GenInitMine           :       0.429
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.775
--GenFinalMine          :       0.001
GenRecordTot            :       0.002
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

  Times for HIST
TOTAL                   :      61.884
Initialize              :       0.268
Finalize                :       0.002
Run                     :      61.613
--Couplers              :       0.611
--I/O                   :      61.092
----IO Create           :       1.741
----IO Write            :      61.092
-----IO Post            :       0.722
-----IO Wait            :       8.456
-----IO Write           :      52.216
-ParserRun              :       0.000
GenInitTot              :       0.000
--GenInitMine           :       0.000
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.002
--GenFinalMine          :       0.002
GenRecordTot            :       0.000
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

  Times for EXTDATA
TOTAL                   :     139.762
Initialize              :       7.516
Run                     :     132.240
-Read_Loop              :     131.591
--CheckUpd              :       0.103
--Read                  :     121.072
--Swap                  :       0.005
--Bracket               :      15.497
-Interpolate            :       0.649
GenInitTot              :       0.005
--GenInitMine           :       0.005
GenRunTot               :       0.000
--GenRunMine            :       0.000
GenFinalTot             :       0.004
--GenFinalMine          :       0.004
GenRecordTot            :       0.000
--GenRecordMine         :       0.000
GenRefreshTot           :       0.000
--GenRefreshMine        :       0.000

 NOT using buffer I/O for file: cap_restart

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f74b41fd2da in ???
#1  0x7f74b41fc503 in ???
#2  0x7f74b3639f1f in ???
#3  0x7f74b20ff6b4 in ???
#4  0x7f74b20c2151 in ???
#5  0x7f74b2053016 in ???
#6  0x7f74b1fa76e1 in ???
#7  0x7f74b20512af in ???
#8  0x7f74b20ff982 in ???
#9  0x7f74b2051c10 in ???
#10  0x7f74b1fa3ff4 in ???
#11  0x7f74b1f3cacc in ???
#12  0x7f74b363e040 in ???
#13  0x7f74b363e139 in ???
#14  0x7f74b438e083 in ???
#15  0x561eb68d3cef in ???
#16  0x561eb66991be in ???
#17  0x7f74b361cb96 in ???
#18  0x561eb66991e9 in ???
#19  0xffffffffffffffff in ???
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 0 on
node ip-172-31-88-121 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

yantosca commented 5 years ago

It seems that the code is exiting the Finalize_ routine of Chem_GridCompMod.F90 normally, since the end of PET00000.GEOSCHEMchem.log is written:

### GIGCchem::Finalize_  |  Cleanup on PET # 00000

I also put some debug output in GIGCGridCompMod::Finalize and it appears it is getting calling GIGC_CHUNK_FINAL and destroying the state objects properly. So wherever the core dump is, it is maybe a level or two higher.

In the meantime, I'll push my edits for deallocating State_Diag to a bugfix branch of GCHP so that we can use it going forward.

JiaweiZhuang commented 5 years ago

Thanks for the fix. Now the cloud should have exactly the same behavior as Odyssey?

yantosca commented 5 years ago

Yes, I think the cloud will behave the same.

One clarification: I updated the GCC and GCHP folders to 12.1.1, then rebuilt with "make clean_gc; make compile_standard".

yantosca commented 5 years ago

Pushed the fix to deallocate State_Diag to the bugfix/GCHP_issues branch. This will be merged into the next GCHP version.

geoschem / gchp_legacy

[BUG/ISSUE] Timing results are incomplete #6