Closed jimmielin closed 2 years ago
The timing output written by MAPL is indeed confusing. The three columns in the "Times for GCHPchem" from left to right correspond to minimum, mean, and maximum processor time. "GCHPchem" corresponds to the entire gridded component for GEOS-Chem, so includes other components besides chemistry, e.g. convection. To assess timing for a specific part of GEOS-Chem you should use the timers that start with prefix GC_.
Toggling those timers on and off are mostly in geos-chem/Interfaces/GCHP/gchp_chunk_mod.F90
, but also in geos-chem/Interfaces/GCHP/Chem_GridCompMod.F90
, using MAPL subroutine MAPL_TimerOn
and MAPL_TimerOff
. When in doubt about what a timer is measuring it is best to check the source code to see what calls it is wrapping. For example:
IF ( DoConv ) THEN
if(Input_Opt%AmIRoot.and.NCALLS<10) write(*,*) ' --- Do convection now'
CALL MAPL_TimerOn( STATE, 'GC_CONV' )
CALL DO_CONVECTION ( Input_Opt, State_Chm, State_Diag, &
State_Grid, State_Met, RC )
_ASSERT(RC==GC_SUCCESS, 'Error calling DO_CONVECTION')
CALL MAPL_TimerOff( STATE, 'GC_CONV' )
if(Input_Opt%AmIRoot.and.NCALLS<10) write(*,*) ' --- Convection done!'
ENDIF
For chemistry you should look at timer GC_CHEM
which includes the calls to compute overhead ozone, set H2O, and Do_Chemistry
. Whether to use min, mean, or max is a judgement call.
Regarding the rest of the timing info, I am not exactly sure why the numbers do not match up, although I would expect it is because the various sub-timers do not break up the run in exact ways. @LiamBindle, do you remember if Tom explained this? Do you also remember the precise definitions for inclusive versus exclusive?
Inclusive refers to the time spent in that function, including the time spent in called child function. Exclusive refers to the time purely spent in that function, and excludes time spent in called child functions.
The timers are tricky to interpret because the component times don't add up to the total run time (in an obvious way). I looked into this a while back, and I believe this is because of the load imbalance; MAPL handles the process synchronization, so the time a process spends waiting for other processes to finish chemistry is time spent in MAPL (not chemsitry). I believe this sync waiting time is the bulk of the "Run" exclusive time is.
The way I interpret the timers currently is:
If you add these up you get very close to the total run time: 2242.760+926.209+391.262+165.639+1572.164=5298.034. If you include minor parts I left out (Finalize inclusive, Initialize inclusive, SetService inclusive, GCHPctmEnv inclusive) you get even closer (<4 seconds unaccounted for).
Thanks @lizziel and @LiamBindle for explaining, this makes it much clearer. The reason I bring this up is because the run results in my particular case ("chemistry" taking 2242s in a total runtime of 5298s) seem to be inconsistent with the results in Eastham et al., 2018 and Bindle et al., 2021, where chemistry is the dominant component.
Looking at your Table 3 in Bindle et al. 2021 (https://gmd.copernicus.org/articles/14/5977/2021/) @LiamBindle may I confirm if you didn't see as much process synchronization time in your run(s) or it was added into the chemistry time? I suppose in the context of MPI it can be argued that synchronization time following chemistry is still chemistry, because some processors are working on chemistry while others are waiting in idle. I was just curious as to how I can interpret this into a breakdown of "chemistry", "dynamics", "data input", and "other" as you did so I can evaluate the speed-up of GCHP when using an optimized version of KPP.
@jimmielin, could you also report number of cores used, grid resolution, simulation, and duration of run?
Hi @lizziel sorry I forgot to report this. Here's my runConfig.sh
: it's a standard c24 run for 14 days using 48 cores on huce_cascade
.
TOTAL_CORES=48
NUM_NODES=1
NUM_CORES_PER_NODE=48
CS_RES=24
STRETCH_GRID=OFF
STRETCH_FACTOR=2.0
TARGET_LAT=-45.0
TARGET_LON=170.0
Start_Time="20190701 000000"
End_Time="20190715 000000"
Duration="00000014 000000"
# input.geos:
Turn_on_Chemistry=T
Turn_on_Dry_Deposition=T
Turn_on_Wet_Deposition=T
Turn_on_Transport=T
Turn_on_Cloud_Conv=T
Turn_on_PBL_Mixing=T
Turn_on_Non_Local_Mixing=T
#
# HEMCO_Config.rc:
Turn_on_Base_Emissions=true
#---------------------------------------------------------------------
# Diagnostic frequency, duration, and monthly mean
#---------------------------------------------------------------------
AutoUpdate_Diagnostics=ON
# Monthly diagnostics: '0' for off; '1' for on
timeAvg_monthly="0"
# Frequency and duration (ignored if monthly diagnostics on)
timeAvg_freq="060000"
timeAvg_dur="060000"
@jimmielin Yeah, Table 3 would have lumped Run exlusive into Chemistry. If you aren't specifically interested in the load imbalance, I think it's reasonable to count it under chemsitry since that's the source of the imbalance.
@jimmielin, I will close this issue now but feel free to reopen if you have more questions.
Ask a question about GCHP:
Hi all,
I'm running some timing tests on GCHP and I'm trying to understand what the timer outputs from GCHP output mean. I have some output from the end of the run (see end of this issue description) which gives times for
GCHPchem
and a breakdown of individual components.In the breakdown in the end, there's a line saying
which I understand is the wall clock time for the
Run
routine in theGCHPchem
GridComp, which is2242.760
. However, in the breakdown forTimes for GCHPchem
, there's another breakdown:None of these numbers correspond to
2242.760
, nor any combination of the other numbers (0.171
forSetService -- GCHPchem
,66.574
forInitialize -- GCHPchem
, and3.238
forFinalize -- GCHPchem
).My question is:
Times for GCHPchem
mean? i.e., 1772.221, 2300.730, 2830.601...Thank you!
Attached: Excerpt from the end-of-run timer output from GCHP 13.3