geoschem / HEMCO

The Harmonized Emissions Component (HEMCO), developed by the GEOS-Chem Support Team.
https://hemco.readthedocs.io
Other
16 stars 32 forks source link

[BUG/ISSUE] Significant performance degradation on multiple platforms affecting GEOS-Chem Classic and GCHP #57

Open LiamBindle opened 3 years ago

LiamBindle commented 3 years ago

There appears to be a significant hit in GEOS-Chem Classic and GCHP performance on some platforms, particularly those using gfortran, stemming from somewhere in HEMCO. This issue has been observed by @lizziel, @jimmielin, @WilliamDowns, and myself.

The four of us just finished a call discussing the issue and how to proceed. Some notes from that meeting are in the collapsed block below

Zoom call notes - HEMCO slow down call [2020-10-16 Fri 10:00] - Wil has seen in calc emis - Get current emissions - Haipeng had similar with get emis - Open MP loop becomes serial loop - Removing subroutine call (copy-pasting code directly into get emis loop fixed issue) - His slow down was in boundary layer mixing (~100x) - Potentially Open MP issue - GCHP still uses OpenMP, only problematic >1000 cores - Troubleshoot idea: check if there's a difference in performance when GCHP::HEMCO is built without OpenMP - Will saw slow down in Get Current emissions code - Some (but not all) emissions containers hang for a while - Haipeng's seen some containers would hang for a while too - Look at gfortran flags that could give warnings about performance things - Can it be reproduced with HEMCO standalone - Lizzie is going to look into - Liam will open issue on HEMCO - Liam will check if building HEMCO without OpenMP speeds up - Will will continue running timing tests and narrowing in on which emissions collections are problematic - Haipeng will try GEOS-Chem classic with gfortran and some different compiler setting

Recordings of the issue

Lizzie's internal benchmarks show that HEMCO's performance in GEOS-Chem Classic with gfortran has been deteriorating. Wall times for HEMCO in GEOS-Chem Classic :

Version gfortran ifort
alpha.4 396 232
alpha.5 976 216
alpha.9 1732 376

In some GCHP timing tests that Lizzie and I ran a few weeks ago, I observed a very significant drop off in GCHP's scaling (see figure below). Note that the solid line are Lizzie's tests with ifort and the dashed lines are my tests with gfortran. The drop off in performance was dominated by a big slow down in HOCI_GC_RUN(...).

throughput

Further information

Action items

We will continue discussing in this thread.

WilliamDowns commented 3 years ago

Moving the entire code of get_current_emissions into the body of hco_calcemis has not resulted in a speed-up (my slowdown continues to occur somewhere in that function's work). Will continue timing tests / pinning down where it's occurring in there.

LiamBindle commented 3 years ago

As @jimmielin suggested, it definitely appears to be related to OpenMP.

Here is my timing test result for GCHP alpha.10 with OpenMP enabled and disabled. These were 2-week timing tests at C48 with 192 cores and gfortran 8.3.

OpenMP Model throughput
Enabled 113 days/day
Disabled 163 days/day

Here is the same figure I showed above, but including this new timing test with OpenMP disabled. Note that the blue X corresponds to this new timing test with OpenMP disabled. Also note that the blue X is the same timing test as the solid red point below it.

image

There is a 40-90% increase in speed if OpenMP is disabled. I think this confirms it's related to OpenMP!

lizziel commented 3 years ago

Interesting. I'll do my HEMCO standalone tests for both OpenMP enabled and disabled.

lizziel commented 3 years ago

I am seeing significant differences in run-time in HEMCO standalone due to OpenMP. The differences across compilers are negligible in comparison. For all compilers disabling OpenMP results in reduction of run-time.

Wall-times for 1-month run Compiler OMP=y (24 threads) OMP=n
gfortran 8.3 00:19:37 00:15:04
gfortran 9.3 00:21:32 00:14:59
ifort18 00:21:15 00:17:10
ifort19 00:21:41 00:16:10

All tests used HEMCO branch dev/GC_13.0.0. For the run directory I used GC-Classic fullchem HEMCO_Config.rc and HEMCO_Diagn.rc from GEOS-Chem branch dev/13.0.0.

WilliamDowns commented 3 years ago

For a 6-hour simulation in GCHPctm with 30 cores on 1 node using gfortran 9.3 and Spack-built OpenMPI 4.0.5, disabling OpenMP yields a total run time of 14m14s. With OpenMP enabled, the same run has not finished after 1h40m (it's 3h40m into the run). Mine may not be an issue unique to HEMCO based on cursory glances at the pace of the run outside of the very slow emissions timesteps. I'll have to rerun for more details as Cannon is about to shutdown.

yantosca commented 3 years ago

Has there been any more work into this issue? I am wondering if maybe using a combination of !$OMP COLLAPSE() and !OMP SCHEDULE( DYNAMIC, n ) might help here. I can try to look into this as I have been doing other profiling tests.

yantosca commented 3 years ago

I have been looking into this. It seems that testing if pointers LevDct1, LevDct2 are associated within routine GetVertIndx can have a big performance hit. The pointers are associated in GET_CURRENT_EMISSIONS and so can be tested there, and logical arguments can be passed to GetVertIndx indicating if the pointers are null or not.

Also a similar check on whether the PBLHEIGHT and BXHEIGHT fields are in the HcoState can be moved out of routine GetIdx, which is called on each (I,J). I am not sure if this breaks anything though.

I will do more testing/profiling. At first glance the COLLAPSE does not bring as big of a benefit as I thought, but cleaning up these tests if pointers are associated seem to be the way to get more performance here.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

lizziel commented 3 years ago

From comment above: "Also a similar check on whether the PBLHEIGHT and BXHEIGHT fields are in the HcoState can be moved out of routine GetIdx, which is called on each (I,J). I am not sure if this breaks anything though."

It turns out this does break things in the new HEMCO intermediate grid update I am bringing in (GEOS-Chem PR 681). Initialization of State_Met%LandTypeFrac previously used HCO_GetPtr. With the update it instead uses HCO_GC_EvalFld, a new wrapper around HCO_EvalFld. With this method the data is retrieved via Get_Current_Emissions where the new checks are outside of the IJ loop.

The problem with this is that during initialization Emissions_Run has only been called with Phase=0. Phase 0 skips the call to GridEdge_Set, and by extension HCO_SetPBLm, resulting in HcoState%Grid%PBLHEIGHT%Val not yet being associated. I don't think the code that uses PBLHEIGHT would ever be reached for the 2D Olson land map, but the new code still checks that it is associated.

I'm considering two different possible fixes: (1) Wrap the BXHEIGHT%Val and PBLHEIGHT%Val associated checks in Get_Current_Emissions so they are only done if isLevDct1 or isLevDct2. For this solution I'm not sure if the error handling would still cover all cases. (2) Change the Olson data retrieval back to getting a pointer (@jimmielin, could you comment on why you made this change?)

jimmielin commented 3 years ago

Hi Lizzie - I think we can just use HCO_GC_GetPtr instead. I changed it to EvalFld as we were planning to changing the GetPtrs to EvalFlds, but it would be unphysical to scale land type fractions anyway. One thing to note with the new intermediate grid updates is that the data has to be copied out of the pointer.

I think simply replacing CALL HCO_GC_EvalFld( Input_Opt, State_Grid, Name, State_Met%LandTypeFrac(:,:,T), RC, FOUND=FND )

with a call to CALL HCO_GC_GetPtr ( ... Ptr2D ...) then copying the data like State_Met%LandTypeFrac(:,:,T) = Ptr2D(:,:) should resolve this issue.

lizziel commented 3 years ago

Okay, I will put it back to GetPtr then. I previously kept it as GetPtr rather than change it to EvalFld because, as you say, it doesn't make sense to scale/mask the land mask.

I'm still wondering if we could move those PBLHEIGHT checks elsewhere. It should only need to be checked once during a run.

lizziel commented 3 years ago

Does anyone know if this OpenMP problem in HEMCO is still an issue in GEOS-Chem 13? @jimmielin, @WilliamDowns, @LiamBindle

LiamBindle commented 3 years ago

Sorry, I don't. I'm not aware of any updates on it though so I suspect it still exists.

bbakernoaa commented 2 years ago

Has there been any update on this? I'm noticing when running MEGAN offline that it doesn't scale at all with CPUs.