Closed JiaweiZhuang closed 5 years ago
We have run into this problem in the past. It is an issue with cleanup of state_met_mod (State_Met Finalize) and only occurs when the model was built with gfortran 6+. What version of fortran are you using? I will test with it on Odyssey in our dev branch.
I have tested with gfortran 7.2.1/7.3.0/7.3.1; all have the same issue
I successfully compiled and ran to completion with gfortran 7.1.0 on Odyssey. Are you sure you are using 12.1.0 and not 12.0.0? There was an issue in 12.0.0 that caused this symptom and it was fixed in 12.0.1 with https://github.com/geoschem/geos-chem/commit/8084db057df1894e0fc3aa8c43883911b46921e6.
Yes I am using 12.1.0. I will build a public GCHP AMI today and you can use it for debugging. Does this sound good?
Thanks, one of us will take a look. For what it's worth I also tried 8.2.0 on Odyssey with success.
OK I've made a public AMI: ami-0a5973f14aad7413a
in region US East (N. Virginia)
Use at least r5.2xlarge
to provide enough memory.
After login, simply
$ cd ~/tutorial/gchp_standard
$ mpirun -np 6 ./geos | tee run_gchp.log
I am still writing formal documentation but this should be enough for testing.
Interestingly, after I fix the restart file issue by https://github.com/geoschem/gchp/issues/8#issuecomment-446026513, even Times for GIGCenv
is not printed. Now the run terminates even earlier:
Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4
Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
Writing: 510 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4
Writing: 510 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4
MAPL_CFIOWriteBundlePost 1908
HistoryRun 2947
MAPL_Cap 833
Full log files (only differ at the last few lines)
Use ami-0f44e999c80ef6e66
to reproduce this issue.
Update: this is not just about incomplete timing info. The run simply crashes at hour 1. See https://github.com/geoschem/gchp/issues/8#issuecomment-446044798
This new issue (crash after 1hr run) will be addressed in https://github.com/geoschem/gchp/issues/8. To restrict the problem to the cleanup_state_met issue you can turn off all diagnostics collections in HISTORY.rc.
Have been looking at this issue.
Ran on the Amazon cloud in r5.2xlarge with AMI ID: GCHP12.1.0_tutorial_20181210 (ami-0f44e999c80ef6e66)
In HISTORY.rc I turned on only these collections (1) SpeciesConc_avg : only archived SpeciesConc_NO (2) SpeciesConc_inst : only archived SpeciesConc_NO (3) StateMet_avg : only archived Met_AD, Met_OPTD, Met_PSC2DRY, Met_PSC2WET, Met_SPHU, Met_TropHt, Met_TropLev, Met_TropP (4) StateMet_inst: only archived Met_AD
This run (1 hour) on 6 cores finished with all timing information:
GIGCenv: total 0.346 GIGCchem total: 123.970 Dynamics total: 18.741 GCHP total: 140.931 HIST total: 0.264 EXTDATA total: 133.351
So I am wondering if this is a memory issue. If we select less than a certain amount of diagnostics the run seems to finish fine. Maybe this is OK for the GCHP tutorial but there doesn't seem to be too much rhyme or reason as to why requesting more diagnostics fails. Maybe the memory limits in the instance? I don't know.
This AMI was built with mpich2 MPI. Maybe worth trying with OpenMPI on the cloud?
Also note: This run finished w/o dropping a core file (as currently happens on Odyssey). So this appears to be an Odyssey-specific environment problem.
Also, if I run with no diagnostics turned on then the run dies at 10 minutes
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% USING O3 COLUMNS FROM THE MET FIELDS! %%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- RDAER: Using online SO4 NH4 NIT!
- RDAER: Using online BCPI OCPI BCPO OCPO!
- RDAER: Using online SALA SALC
- DO_STRAT_CHEM: Linearized strat chemistry at 2016/07/01 00:00
###############################################################################
# Interpolating Linoz fields for jul
###############################################################################
- LINOZ_CHEM3: Doing LINOZ
===============================================================================
Successfully initialized ISORROPIA code II
===============================================================================
--- Chemistry done!
--- Do wetdep now
--- Wetdep done!
Setting history variable pointers to GC and Export States:
AGCM Date: 2016/07/01 Time: 00:10:00
Memuse(MB) at MAPL_Cap:TimeLoop= 4.638E+03 4.409E+03 2.223E+03 2.601E+03 3.258E+03
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop= 1.823E+04 0.000E+00
MAPL_ExtDataInterpField 3300
EXTDATA::Run_ 1471
MAPL_Cap 777
application called MPI_Abort(MPI_COMM_WORLD, 21856) - process 0
From the traceback it looks as if it's hanging in interpolating a field in ExtData.
The run also crashes at 00:10 if I only save one collection SpeciesConc_inst
with only two species SpeciesConc_NO
and SpeciesConc_O3
in it.
--- Chemistry done!
--- Do wetdep now
--- Wetdep done!
Setting history variable pointers to GC and Export States:
SpeciesConc_NO
SpeciesConc_O3
AGCM Date: 2016/07/01 Time: 00:10:00
Memuse(MB) at MAPL_Cap:TimeLoop= 4.723E+03 4.494E+03 2.306E+03 2.684E+03 3.260E+03
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop= 1.852E+04 0.000E+00
offline_tracer_advection
ESMFL_StateGetPtrToDataR4_3 54
DYNAMICSRun 703
GCHP::Run 407
MAPL_Cap 792
But with two collections SpeciesConc_avg
and SpeciesConc_inst
, each with only two species SpeciesConc_NO
and SpeciesConc_O3
in it, the run is able to finish and print full timing information:
Writing: 144 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_avg.20160701_0530z.nc4
Writing: 144 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_inst.20160701_0600z.nc4
Times for GIGCenv
TOTAL : 2.252
INITIALIZE : 0.000
RUN : 2.250
GenInitTot : 0.004
--GenInitMine : 0.003
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.000
--GenFinalMine : 0.000
GenRecordTot : 0.001
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
HEMCO::Finalize... OK.
Chem::Input_Opt Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc
Times for GIGCchem
TOTAL : 505.760
INITIALIZE : 3.617
RUN : 498.376
FINALIZE : 0.000
DO_CHEM : 488.864
CP_BFRE : 0.121
CP_AFTR : 4.080
GC_CONV : 36.070
GC_EMIS : 0.000
GC_DRYDEP : 0.119
GC_FLUXES : 0.000
GC_TURB : 17.966
GC_CHEM : 403.528
GC_WETDEP : 19.443
GC_DIAGN : 0.000
GenInitTot : 2.719
--GenInitMine : 2.719
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.963
--GenFinalMine : 0.963
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
-----------------------------------------------------
Block User time System Time Total Time
-----------------------------------------------------
TOTAL 815.4433 0.0000 815.4433
COMM_TOTAL 3.3098 0.0000 3.3098
COMM_TRAC 3.3097 0.0000 3.3097
FV_TP_2D 90.1448 0.0000 90.1448
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 3126 RUNNING AT ip-172-31-0-74
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Based on these reports I think the issue of not completing the timing report is not related to cleaning up the State_Met array. I ran into similar symptoms sometimes, but not all times, on the Harvard Odyssey cluster when there was an issue writing out diagnostic files. I found that if I reduced the total memory across all diagnostics then the run went to completion. Sometimes when the problem was encountered the log appeared to have a problem writing the timing.
This issue started on the Odyssey cluster after the operating system upgrade from CentOS6 to CentOS7. We were using OpenMPI 1.10.3 when the switch happening. Upgrading to OpenMPI 2 did not fix the issue. Then upgrading to OpenMPI 3 (either 3.0 or 3.1) corrected the problem for unknown reasons. I therefore think this issue is related to OS and MPI version compatibility, although we never figured out the cause so this is just a theory.
Tested with OpenMPI 2.2.1, still cannot print the full timing (#10)
This is choking in Cleanup_State_Met.
I suggest seeing if this issue goes away after upgrading to OpenMPI 3.
If it does not, adding print statements through cleanup_state_met to see where it dies should give insight on if cleanup of a specific met-field is a problem. Many State_met fields point to MAPL imports so proper deallocation/nullify can be an issue due to the ordering of state_met and import cleanup.
Tested with gcc 7.3.0 + OpenMPI 3.1.3, the timing info is still incomplete. Again stops at Times for GIGCenv
run_openmpi3_default_diags.log
Can be reproduced in ami-01074a30392daa0f9
I cannot use the tricks for MPICH3 (saving SpeciesConc_inst
and SpeciesConc_avg
and turning off emission #10) to get full timing for OpenMPI3.
It always terminates at Times for GIGCenv
I discovered that we were not deallocating State_Diag at the end of a GCHP run. So I now pass State_Diag from Chem_GridCompMod to routine GIGC_CHUNK_FINAL. Then within GIGC_CHUNK_FINAL, we call Cleanup_State_Diag. This now prints out all timing info. But the run still drops core at the end as in #11.
Here is a snippet of the log file at the last timestep of the run:
AGCM Date: 2016/07/01 Time: 03:00:00
Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_avg.20160701_0230z.nc4
Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_inst.20160701_0300z.nc4
Times for GIGCenv
TOTAL : 1.013
INITIALIZE : 0.000
RUN : 1.009
GenInitTot : 0.003
--GenInitMine : 0.003
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.001
--GenFinalMine : 0.001
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
Character Resource Parameter GIGCchem_INTERNAL_CHECKPOINT_TYPE: pnc4
Using parallel NetCDF for file: gcchem_internal_checkpoint_c24.nc
Times for GIGCchem
TOTAL : 254.818
INITIALIZE : 6.647
RUN : 244.785
FINALIZE : 0.000
DO_CHEM : 240.695
CP_BFRE : 0.048
CP_AFTR : 1.741
GC_CONV : 15.002
GC_EMIS : 12.660
GC_DRYDEP : 0.042
GC_FLUXES : 0.000
GC_TURB : 8.096
GC_CHEM : 193.611
GC_WETDEP : 10.119
GC_DIAGN : 0.000
GenInitTot : 2.440
--GenInitMine : 2.440
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.744
--GenFinalMine : 0.744
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
-----------------------------------------------------
Block User time System Time Total Time
-----------------------------------------------------
TOTAL 506.1147 0.0000 506.1147
COMM_TOTAL 1.1536 0.0000 1.1536
COMM_TRAC 1.1536 0.0000 1.1536
FV_TP_2D 40.6695 0.0000 40.6695
Times for DYNAMICS
TOTAL : 56.285
INITIALIZE : 2.143
RUN : 56.273
FINALIZE : 0.000
GenInitTot : 0.011
--GenInitMine : 0.011
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.001
--GenFinalMine : 0.001
GenRecordTot : 0.001
--GenRecordMine : 0.001
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
Times for GCHP
TOTAL : 310.187
RUN : 301.969
GIGCenv : 1.015
GIGCchem : 252.394
DYNAMICS : 58.406
GenInitTot : 7.447
--GenInitMine : 0.429
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.775
--GenFinalMine : 0.001
GenRecordTot : 0.002
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
Times for HIST
TOTAL : 61.884
Initialize : 0.268
Finalize : 0.002
Run : 61.613
--Couplers : 0.611
--I/O : 61.092
----IO Create : 1.741
----IO Write : 61.092
-----IO Post : 0.722
-----IO Wait : 8.456
-----IO Write : 52.216
-ParserRun : 0.000
GenInitTot : 0.000
--GenInitMine : 0.000
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.002
--GenFinalMine : 0.002
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
Times for EXTDATA
TOTAL : 139.762
Initialize : 7.516
Run : 132.240
-Read_Loop : 131.591
--CheckUpd : 0.103
--Read : 121.072
--Swap : 0.005
--Bracket : 15.497
-Interpolate : 0.649
GenInitTot : 0.005
--GenInitMine : 0.005
GenRunTot : 0.000
--GenRunMine : 0.000
GenFinalTot : 0.004
--GenFinalMine : 0.004
GenRecordTot : 0.000
--GenRecordMine : 0.000
GenRefreshTot : 0.000
--GenRefreshMine : 0.000
NOT using buffer I/O for file: cap_restart
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7f74b41fd2da in ???
#1 0x7f74b41fc503 in ???
#2 0x7f74b3639f1f in ???
#3 0x7f74b20ff6b4 in ???
#4 0x7f74b20c2151 in ???
#5 0x7f74b2053016 in ???
#6 0x7f74b1fa76e1 in ???
#7 0x7f74b20512af in ???
#8 0x7f74b20ff982 in ???
#9 0x7f74b2051c10 in ???
#10 0x7f74b1fa3ff4 in ???
#11 0x7f74b1f3cacc in ???
#12 0x7f74b363e040 in ???
#13 0x7f74b363e139 in ???
#14 0x7f74b438e083 in ???
#15 0x561eb68d3cef in ???
#16 0x561eb66991be in ???
#17 0x7f74b361cb96 in ???
#18 0x561eb66991e9 in ???
#19 0xffffffffffffffff in ???
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 0 on
node ip-172-31-88-121 exiting improperly. There are three reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
You can avoid this message by specifying -quiet on the mpirun command line.
It seems that the code is exiting the Finalize_ routine of Chem_GridCompMod.F90 normally, since the end of PET00000.GEOSCHEMchem.log is written:
### GIGCchem::Finalize_ | Cleanup on PET # 00000
I also put some debug output in GIGCGridCompMod::Finalize and it appears it is getting calling GIGC_CHUNK_FINAL and destroying the state objects properly. So wherever the core dump is, it is maybe a level or two higher.
In the meantime, I'll push my edits for deallocating State_Diag to a bugfix branch of GCHP so that we can use it going forward.
Thanks for the fix. Now the cloud should have exactly the same behavior as Odyssey?
Yes, I think the cloud will behave the same.
One clarification: I updated the GCC and GCHP folders to 12.1.1, then rebuilt with "make clean_gc; make compile_standard".
Pushed the fix to deallocate State_Diag to the bugfix/GCHP_issues branch. This will be merged into the next GCHP version.
I've successfully run GCHP 12.1.0 on AWS, with both Ubuntu (gcc 7.3.0) and CentOS (gcc 7.3.1). CentOS was already working before; I am glad that Ubuntu now also works (again with lots of dirty fixes). I strongly prefer Ubuntu as it has a lot more pre-packaged libraries and the environment is a lot faster to build (no need to compile libraries from source).
Full logs for record:
However, for both OS, the log files only show the time for GIGCenv, but no other components.
Does the same issue happen on Odyssey?