geoschem / gchp_legacy

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.
http://wiki.geos-chem.org/GEOS-Chem_HP
Other
7 stars 13 forks source link

[BUG/ISSUE] Early termination at different points depending on diagnostics configuration #10

Closed JiaweiZhuang closed 5 years ago

JiaweiZhuang commented 5 years ago

Trying to summarize different behaviors in #6 #8 #9 as things are getting messy...

1. Crashes at the first time step (at 00:10)

Typical error message is

 Setting history variable pointers to GC and Export States:
 SpeciesConc_NO
 SpeciesConc_O3
 AGCM Date: 2016/07/01  Time: 00:10:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.723E+03  4.494E+03  2.306E+03  2.684E+03  3.260E+03
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.852E+04  0.000E+00
 offline_tracer_advection
ESMFL_StateGetPtrToDataR4_3                     54
DYNAMICSRun                                    703
GCHP::Run                                      407
MAPL_Cap                                       792

2. Crashes when writing the first diagnostics file (at 01:00)

Typical error message is

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4
 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
 Writing:    510 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4
 Writing:    510 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4
                MAPL_CFIOWriteBundlePost      1908
HistoryRun                                    2947
MAPL_Cap                                       833
application called MPI_Abort(MPI_COMM_WORLD, 21944) - process 0

Don't seem a memory problem; still happens on r5.4xlarge with 128 GB RAM.

3. Crashes right before printing timing information

Times for GIGCenv
...
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3126 RUNNING AT ip-172-31-0-74
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

By tweaking the amount of output, I can get more timing info printed, but the run still ends with BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES:

4. Run to the end with the full timing information printed

This means no error message occurs. The model prints all the way down to Times for EXTDATA and creates the new cap_restart files.

I can only make this happen with very tricky configurations:

Log file for this only-successful-run so far: run_two_collections_emission_off.log

Environment

All tests are performed with MPICH 3.3 and gcc 7.3.0 on Ubuntu 18.04 (ami-0a5973f14aad7413a).

I also have OpenMPI 2.1 working (scripts).

I consider this even worse because basically no diagnostics can be archived. With MPICH at least it is functioning in some cases.

lizziel commented 5 years ago

To summarize current status, after switching from MPICH 3.3 to OpenMPI 2.1:

  1. All diagnostic collections off: timing info incomplete due to early termination (see https://github.com/geoschem/gchp/issues/6)
  2. Any diagnostic collections on: diagnostic write hang at end of run (this issue)

I suggest seeing if this issue goes away after upgrading to OpenMPI 3 since switching from OpenMPI 2 to OpenMPI 3 on the Odyssey cluster fixed this same issue after a switch to a new operating system (CentOS6 to CentOS7).

JiaweiZhuang commented 5 years ago

With OpenMPI 3.1.3:

JiaweiZhuang commented 5 years ago

Because OpenMPI 3 seems the most robust configuration right now, I suggest focusing on fixing its timing issue, and put aside MPICH3 and OpenMPI2 for now.

I made another AMI ami-01074a30392daa0f9 with OpenMPI 3.1.3

cd ~/tutorial/gchp_standard
mpirun -np 6 -oversubscribe ./geos

-oversubscribe is needed for OpenMPI 3 runtime when the # of physical cores is less the # of MPI processes.

lizziel commented 5 years ago

Thanks Jiawei, I am closing this issue since switching to OpenMPI 3 fixed the file write hang issue. The issue for incomplete timing info at the end of the run is still open and will be tracked separately in https://github.com/geoschem/gchp/issues/6.

JiaweiZhuang commented 5 years ago

I think we should put a warning on wiki regarding issues with MPICH3 and OpenMPI2. On a lot of shared systems users cannot install whatever MPI they want, unlike the cloud.

lizziel commented 5 years ago

Yes, I completely agree. This will be done this week.

-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team

Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.

From: Jiawei Zhuang notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Thursday, December 13, 2018 at 12:07 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, State change state_change@noreply.github.com Subject: Re: [geoschem/gchp] Early termination at different points depending on diagnostics configuration (#10)

I think we should put a warning on wiki regarding issues with MPICH3 and OpenMPI2. On a lot of shared systems users cannot install whatever MPI they want, unlike the cloud.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_10-23issuecomment-2D447045826&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=ufHnU7-e1-NW4fL96VAHJJr174h3vxgazhrtxKGJQgQ&s=6f6u1oFzPqSf5cHG9HpJ-G4vEvyyvSDIQMDrU2bMudA&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyq09cMu5zvuDdK38ZwWlzKIuWHNuAks5u4okugaJpZM4ZQdu1&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=ufHnU7-e1-NW4fL96VAHJJr174h3vxgazhrtxKGJQgQ&s=iItnQopIpWq1nE8tf4Lgj-HQIBlxmQGvgfteKIe-xZw&e=.