Closed yantosca closed 5 years ago
Was the restart file successfully written? Also, is this only gfortran 8.2, or earlier versions as well?
-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team
Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.
From: Bob Yantosca notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Wednesday, December 19, 2018 at 4:21 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Mention mention@noreply.github.com Subject: [geoschem/gchp] GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run (#13)
I tried running a GCHP C48 run on Odyssey but the job hung right after printing out the GIGCenv timer results.
AGCM Date: 2016/07/01 Time: 01:00:00
Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_avg.20160701_0030z.nc4 Writing: 11592 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4 Writing: 72 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.StateMet_avg.20160701_0030z.nc4 Writing: 72 Slices ( 1 Nodes, 1 PartitionRoot) to File: OutputDir/GCHP.StateMet_inst.20160701_0100z.nc4
Times for GIGCenv TOTAL : 1.069 INITIALIZE : 0.000 RUN : 0.418 GenInitTot : 0.650 --GenInitMine : 0.650 GenRunTot : 0.000 --GenRunMine : 0.000 GenFinalTot : 0.000 --GenFinalMine : 0.000 GenRecordTot : 0.001 --GenRecordMine : 0.001 GenRefreshTot : 0.000 --GenRefreshMine : 0.000
HEMCO::Finalize... OK. Chem::State_Diag Finalize... OK. Chem::State_Chm Finalize... OK. Chem::State_Met Finalize... OK. Chem::Input_Opt Finalize... OK. Using parallel NetCDF for file: gcchem_internal_checkpoint_c48.nc
The script I used to submit the job is: gchp.run.txthttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_files_2696506_gchp.run.txt&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=AYJafJiJsKVhoyuLYkXuaxPWtaUQoDXMxVZu9PY-7Ec&e=
And here is the full log: gchp.log.txthttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_files_2696507_gchp.log.txt&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=RJ6UYkBrg9bk5ik70GRuAMCl5JJp1lg0hShgLi07f3E&e=
A similar run (done by @lizzielhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lizziel&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=QNj-1co2e9V4PlatU9l7r2yM-C940l7ti4gCAviuKIk&e=) with Ifort 17.0.4 instead of gfortran 8.2 finished OK. Am wondering if the Gfortran compiler is not totally compatible with MAPL (or at least it seems to produce issues that we don't see when using ifort).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_13&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=1k-0aD5M6wfZPbra4KtOKHaWETPUlljCXHy26J1p8UA&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyqzz90LYY3kOBIHML80x1fD2zJwPoks5u6q3EgaJpZM4ZbCzq&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=7UtRmp3lH_Jge7qZbLW5pTLXNOSs8hDz3Z62flMc_EA&s=oSK2jcTDGpqcpprAsTvoFCu2VukQAN6r78oT_ZWopr0&e=.
This is gfortran 8.2, did not test earlier versions.
I think the restart files were written OK.
256980 2018-12-19 15:41 gcchem_internal_checkpoint_c48.nc
2059258836 2018-12-19 15:38 gcchem_internal_checkpoint_c48.nc.20160701_0000z.bin
Hmm, it is odd that the end-of-run restart file is so much smaller than the initial checkpoint file. How does it compare to the initial restart file? If you view the output restart file does it look okay? I’d say the file size looks suspicious.
-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team
Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.
From: Bob Yantosca notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Wednesday, December 19, 2018 at 4:33 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Mention mention@noreply.github.com Subject: Re: [geoschem/gchp] GCHP c48 run with gfortran 8.2 on Odyssey hangs before end of run (#13)
This is gfortran 8.2, did not test earlier versions.
I think the restart files were written OK. 256980 2018-12-19 15:41 gcchem_internal_checkpoint_c48.nc 2059258836 2018-12-19 15:38 gcchem_internal_checkpoint_c48.nc.20160701_0000z.bin
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_13-23issuecomment-2D448751548&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=EDaD9yWuS193Zv2wt3csk4GjcEkmbo29iJJKjYQ4uAU&s=K70bvtmAizQJzieUjxI0M2VaYQwlpcGLdoJYCJ-lfko&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyq7G2tQArsUPUEEaBwXsPZgiFAKOaks5u6rC3gaJpZM4ZbCzq&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=EDaD9yWuS193Zv2wt3csk4GjcEkmbo29iJJKjYQ4uAU&s=duEC_zQnVNEJUAaI_V---gd2tUZS025TjkPbBV9rUtk&e=.
The restart file doesn't have any coordinates:
data:
lon = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;
lat = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;
lev = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;
time = _ ;
So it looks like something is messed up in restart file output. If there is an out-of-bounds error maybe that's doing it
Have you tried compiling with debug flags? I think BOPT in GCHP/Shared/Config/ESMA_base.mk is what to configure, to 'g'.
And I think you can add additional flags to the fortran flags section in GIGC.mk in the run directory. Some ideas for what to use are at https://stackoverflow.com/questions/3676322/what-flags-do-you-set-for-your-gfortran-debugger-compiler-to-catch-faulty-code. Maybe this will help with https://github.com/geoschem/gchp/issues/11 and https://github.com/geoschem/gchp/issues/14 as well.
This issue (and also #14) appears to have been caused by an out-of-bounds error in the Olson landmap module. The variable maxFracInd was zero but should not have been. I added a quick fix in the GEOS-Chem "Classic" repo in GeosCore/olson_landmap_mod.F90:
! Get IUSE type index with maximum coverage [mil]
! NOTE: MaxFracInd is a vector of size 1!
maxFracInd = MAXLOC(State_Met%IUSE(I,J,1:State_Met%IREG(I,J)))
!-------------------------------------------------------------------------------
! Prior to 12/20/18:
! Rewrite IF statement to avoid out-of-bounds error (bmy, 12/20/18)
! ! Force IUSE to sum to 1000 by updating max value if necessary
! sumIUSE = SUM(State_Met%IUSE(I,J,1:State_Met%IREG(I,J)))
! IF ( sumIUSE /= 1000 ) THEN
! State_Met%IUSE(I,J,maxFracInd) = State_Met%IUSE(I,J,maxFracInd) &
! + ( 1000 - sumIUSE )
! ENDIF
!-------------------------------------------------------------------------------
! Force IUSE to sum to 1000 by updating max value if necessary
! Also put an error trap on maxFracInd to avoid out-of-bounds errors
! (bmy, 12/20/18)
sumIUSE = SUM(State_Met%IUSE(I,J,1:State_Met%IREG(I,J)))
IF ( sumIUSE /= 1000 .and. maxFracInd(1) > 0 ) THEN
State_Met%IUSE(I,J,maxFracInd(1)) = &
State_Met%IUSE(I,J,maxFracInd(1)) + ( 1000 - sumIUSE )
ENDIF
With this fix, a C48 simulation finished properly on Odyssey, printing out all timing info.
It appears the Olson land map data is not being read in properly, which is the root cause of this issue. I am investigating this.
I am closing this thread because the root cause is #15. Fixing #15 will also fix this issue.
I tried running a GCHP C48 run on Odyssey but the job hung right after printing out the GIGCenv timer results.
The script I used to submit the job is: gchp.run.txt
And here is the full log: gchp.log.txt
A similar run (done by @lizziel) with Ifort 17.0.4 instead of gfortran 8.2 finished OK. Am wondering if the Gfortran compiler is not totally compatible with MAPL (or at least it seems to produce issues that we don't see when using ifort).