geoschem / gchp_legacy

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.
http://wiki.geos-chem.org/GEOS-Chem_HP
Other
7 stars 13 forks source link

[BUG/ISSUE] GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run #13

Closed yantosca closed 5 years ago

yantosca commented 5 years ago

I tried running a GCHP C48 run on Odyssey but the job hung right after printing out the GIGCenv timer results.

AGCM Date: 2016/07/01  Time: 01:00:00

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:      OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4
 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
 Writing:     72 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4
 Writing:     72 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4

  Times for GIGCenv
TOTAL                   :       1.069
INITIALIZE              :       0.000
RUN                     :       0.418
etc.

HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c48.nc

The script I used to submit the job is: gchp.run.txt

And here is the full log: gchp.log.txt

A similar run (done by @lizziel) with Ifort 17.0.4 instead of gfortran 8.2 finished OK. Am wondering if the Gfortran compiler is not totally compatible with MAPL (or at least it seems to produce issues that we don't see when using ifort).

lizziel commented 5 years ago

Was the restart file successfully written? Also, is this only gfortran 8.2, or earlier versions as well?

-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team

Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.

From: Bob Yantosca notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Wednesday, December 19, 2018 at 4:21 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Mention mention@noreply.github.com Subject: [geoschem/gchp] GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run (#13)

I tried running a GCHP C48 run on Odyssey but the job hung right after printing out the GIGCenv timer results.

AGCM Date: 2016/07/01 Time: 01:00:00

Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4 Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4 Writing: 72 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4 Writing: 72 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4

Times for GIGCenv TOTAL : 1.069 INITIALIZE : 0.000 RUN : 0.418 GenInitTot : 0.650 --GenInitMine : 0.650 GenRunTot : 0.000 --GenRunMine : 0.000 GenFinalTot : 0.000 --GenFinalMine : 0.000 GenRecordTot : 0.001 --GenRecordMine : 0.001 GenRefreshTot : 0.000 --GenRefreshMine : 0.000

HEMCO::Finalize... OK. Chem::State_Diag Finalize... OK. Chem::State_Chm Finalize... OK. Chem::State_Met Finalize... OK. Chem::Input_Opt Finalize... OK. Using parallel NetCDF for file: gcchem_internal_checkpoint_c48.nc

The script I used to submit the job is: gchp.run.txthttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_files_2696506_gchp.run.txt&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=AYJafJiJsKVhoyuLYkXuaxPWtaUQoDXMxVZu9PY-7Ec&e=

And here is the full log: gchp.log.txthttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_files_2696507_gchp.log.txt&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=RJ6UYkBrg9bk5ik70GRuAMCl5JJp1lg0hShgLi07f3E&e=

A similar run (done by @lizzielhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lizziel&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=QNj-1co2e9V4PlatU9l7r2yM-C940l7ti4gCAviuKIk&e=) with Ifort 17.0.4 instead of gfortran 8.2 finished OK. Am wondering if the Gfortran compiler is not totally compatible with MAPL (or at least it seems to produce issues that we don't see when using ifort).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_13&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=1k-0aD5M6wfZPbra4KtOKHaWETPUlljCXHy26J1p8UA&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyqzz90LYY3kOBIHML80x1fD2zJwPoks5u6q3EgaJpZM4ZbCzq&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=oSK2jcTDGpqcpprAsTvoFCu2VukQAN6r78oT_ZWopr0&e=.

yantosca commented 5 years ago

This is gfortran 8.2, did not test earlier versions.

I think the restart files were written OK.

    256980 2018-12-19 15:41 gcchem_internal_checkpoint_c48.nc
2059258836 2018-12-19 15:38 gcchem_internal_checkpoint_c48.nc.20160701_0000z.bin
lizziel commented 5 years ago

Hmm, it is odd that the end-of-run restart file is so much smaller than the initial checkpoint file. How does it compare to the initial restart file? If you view the output restart file does it look okay? I’d say the file size looks suspicious.

-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team

Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.

From: Bob Yantosca notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Wednesday, December 19, 2018 at 4:33 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Mention mention@noreply.github.com Subject: Re: [geoschem/gchp] GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run (#13)

This is gfortran 8.2, did not test earlier versions.

I think the restart files were written OK. 256980 2018-12-19 15:41 gcchem_internal_checkpoint_c48.nc 2059258836 2018-12-19 15:38 gcchem_internal_checkpoint_c48.nc.20160701_0000z.bin

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_13-23issuecomment-2D448751548&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=EDaD9yWuS193Zv2wt3csk4GjcEkmbo29iJJKjYQ4uAU&s=K70bvtmAizQJzieUjxI0M2VaYQwlpcGLdoJYCJ-lfko&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyq7G2tQArsUPUEEaBwXsPZgiFAKOaks5u6rC3gaJpZM4ZbCzq&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=EDaD9yWuS193Zv2wt3csk4GjcEkmbo29iJJKjYQ4uAU&s=duEC_zQnVNEJUAaI_V---gd2tUZS025TjkPbBV9rUtk&e=.

yantosca commented 5 years ago

The restart file doesn't have any coordinates:

data:

 lon = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;

 lat = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;

 lev = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;

 time = _ ;

So it looks like something is messed up in restart file output. If there is an out-of-bounds error maybe that's doing it

lizziel commented 5 years ago

Have you tried compiling with debug flags? I think BOPT in GCHP/Shared/Config/ESMA_base.mk is what to configure, to 'g'.

lizziel commented 5 years ago

And I think you can add additional flags to the fortran flags section in GIGC.mk in the run directory. Some ideas for what to use are at https://stackoverflow.com/questions/3676322/what-flags-do-you-set-for-your-gfortran-debugger-compiler-to-catch-faulty-code. Maybe this will help with https://github.com/geoschem/gchp/issues/11 and https://github.com/geoschem/gchp/issues/14 as well.

yantosca commented 5 years ago

This issue (and also #14) appears to have been caused by an out-of-bounds error in the Olson landmap module. The variable maxFracInd was zero but should not have been. I added a quick fix in the GEOS-Chem "Classic" repo in GeosCore/olson_landmap_mod.F90:

       ! Get IUSE type index with maximum coverage [mil]
       ! NOTE: MaxFracInd is a vector of size 1!
       maxFracInd  = MAXLOC(State_Met%IUSE(I,J,1:State_Met%IREG(I,J)))

!-------------------------------------------------------------------------------
! Prior to 12/20/18:
! Rewrite IF statement to avoid out-of-bounds error (bmy, 12/20/18)
!       ! Force IUSE to sum to 1000 by updating max value if necessary
!       sumIUSE =  SUM(State_Met%IUSE(I,J,1:State_Met%IREG(I,J)))
!       IF ( sumIUSE /= 1000 ) THEN
!          State_Met%IUSE(I,J,maxFracInd) = State_Met%IUSE(I,J,maxFracInd) &
!                                           + ( 1000 - sumIUSE )
!       ENDIF
!-------------------------------------------------------------------------------

       ! Force IUSE to sum to 1000 by updating max value if necessary
       ! Also put an error trap on maxFracInd to avoid out-of-bounds errors
       ! (bmy, 12/20/18)
       sumIUSE =  SUM(State_Met%IUSE(I,J,1:State_Met%IREG(I,J)))
       IF ( sumIUSE /= 1000 .and. maxFracInd(1) > 0 ) THEN
          State_Met%IUSE(I,J,maxFracInd(1)) =                             &
          State_Met%IUSE(I,J,maxFracInd(1)) + ( 1000 - sumIUSE )
       ENDIF

With this fix, a C48 simulation finished properly on Odyssey, printing out all timing info.

It appears the Olson land map data is not being read in properly, which is the root cause of this issue. I am investigating this.

yantosca commented 5 years ago

I am closing this thread because the root cause is #15. Fixing #15 will also fix this issue.