performance issues and topics

rgknox commented 6 years ago

As per discussion in #379, we have been seeing decreased performance in our fates simulations.

Lets use this space to document information that we feel helps elucidate performance degradation, and discuss potential places where the code could could be improved or has suffered from a regression.

rgknox commented 6 years ago

Running list of things maybe causing slowness:

Simply more cohorts and patches in our simulations (we can change the maximums allowed)
More PFTs. More PFTs make for larger arrays and larger loops in canopy structure and radiation calculations
There is a new procedure in PlantGrowth that seeks to keep DBH and structural biomass on-allometry. Without this, numerical integration truncation errors make the plant slowly drift off allometry, but I'm not sure if this routine is really worth it. See https://github.com/NGEET/fates/blob/master/biogeochem/FatesAllometryMod.F90#L1778 (no discernible difference in simulation with/without structure reset)
During canopy promotion and demotion, we have some iterative while loops. These (for myriad reasons) could be having a hard time converging. I have a new PR that removes the need for these loops and calculates demotion and promotion to a much higher precision. This may speed-up the code and also solve #378. See https://github.com/rgknox/fates/tree/rgknox-area-fixes
Lots of checks. We have so many "ifs" in the code checking for problems, that this really slows things down. We can't use pre-processing flags to remove these. We could do an audit on checks.
Features. We ask the code to do more things, decide between more options. Deciding between options slows things down.
Allometry. Now that allometry is more "modular", there are some purposefully redundant calls. For instance, to determine how much storage carbon there is, it will automatically calculate leaf biomass, even if the code already knows the value. The reason I kept them in there, is because it ensures that things are calculated correctly.
The code that calculates our history outputs is getting bulkier and bulkier with more variables and dimensions. Some of these every model time-step. Even though we may not write out all of these history variables, we still have to calculate and allocate them all.
changes in the stomatal conductance causing the temperature iteration to take longer to converge

rgknox commented 6 years ago

I've also been able to get DDT (a debugger and profiler) working on cheyenne. I think it would be a good idea to profile the code.

rosiealice commented 6 years ago

What about changes in the stomatal conductance causing the temperature iteration to take longer to converge?

2018-05-23 17:03 GMT-06:00 Ryan Knox notifications@github.com:

I've also been able to get DDT (a debugger and profiler) working on cheyenne. I think it would be a good idea to profile the code.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/386#issuecomment-391527660, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQ6Y4H15gSPuQXomcZdE1YyALqj7tks5t1erRgaJpZM4ULTJT .

--

Dr Rosie A. Fisher

Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

rgknox commented 6 years ago

thanks! added to my list

rosiealice commented 6 years ago

Cool. (r.e. DDT...)

2018-05-23 17:11 GMT-06:00 Ryan Knox notifications@github.com:

thanks! added to my list

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/386#issuecomment-391529258, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQ93mDHRpRhCcnbDUQS3XrTGMkYzDks5t1eywgaJpZM4ULTJT .

--

Dr Rosie A. Fisher

Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

jenniferholm commented 6 years ago

Hi all, Regarding the slowness issue with the new canopy conductance bug fixes, it looks like (at least one) main reason is the number of PFTs indexed. When running the same case with 3 pft indices, vs. 14 pft indices the run is back to it's regular speed. Well, maybe only a tiny bit slower (8 mins vs. ~5-6 mins for 1 simulation year). The 14 pft index case was taking ~35 mins for 1 year.

ckoven commented 6 years ago

@rgknox if your profiling shows option 8 (history bloat) on your list as being important, maybe we should make a global switch called like 'history_detail_level', with values of 1-5, that turn the calculations on and off via something like the following?

is to only calculate the bulk values
adds a handful of 1-D structured values
adds the full set of 1-D structured values
adds a handful of 2-D or larger structured values
is the full set of stuff that we've got possible now

rgknox commented 6 years ago

Thanks for that info @jenniferholm , that is very helpful.

@ckoven, I think your on to something. We could add "group" tags to the output variables. And if all of the members of a group are turned off, then we will forgo allocations and calculations on that group. This will minimize the number of logical evaluations since we are evaluating groups.

Lets try to get a sense of how important it is. Maybe profiling will show the history is not as burdensome.

jenniferholm commented 6 years ago

Hi all, I was wondering if we could circle back around to this decreased performance issue. Could we maybe split up to-do tasks between us from Ryan's list above (and additions from Charlie and Rosie) to help increase performance? (If you are interested of course!).

I'm highly interested, because global runs, with multiple PFTs are pretty expensive and slow now. So I have some motivation to help with this. Is anybody else running global simulations, or single site with >3 PFTs since the new stomatal conductance bug fix and having speed issues? @jkshuman @huitang-earth ?

@rgknox - do you have suggestions from your list of which one lead to the greatest performance increase? Since this didn't happen before the conductance bug fix, I can look into the temperature iteration and if it is taking longer to converge when there are larger arrays with more PFTs. I can also help with option 5, but now sure how critical that one is.

walkeranthonyp commented 6 years ago

Sounds like it could be a little of all of the above. Profiling would certainly be a good idea, especially if you have the capacity to do that.

Generally photosynthesis can be pretty expensive because it is calculated so many times (as you all know). Being wrapped in the canopy numerical solve, its a numerical solve within a numerical solve. Because of this, any increase in runtime of one photosynthesis solve could have potentially large speed up benefits. I'm curious @jenniferholm did you run the 14 pft case prior to the rb bugfix? Is the relative slowdown equivalent to the 3 PFT case or is there some compound issue there where many PFTs and the bugfix cause a greater relative slowdown?

@rgknox could you briefly outline the rb bug, its fix, and the result? I did see some of the discussion after the fact but I didn't catch what the problem was and what effect the fix had. @ckoven mentioned it on the modelling call the other day and if I remember right the fix resulted in increased transpiration. I guess that means the fix resulted in lower rb. Does this also mean that GPP increased?

ckoven commented 6 years ago

@walkeranthonyp the recent bugfix I referred to on the call was #379. The change reduces rather than increases transpiration (because canopy conductance was previously too high by a factor of LAI).

rgknox commented 6 years ago

I'm going to take a look at item 3 (to see whether forcing plants to stay tightly on allometry is worth it)

rgknox commented 6 years ago

re item 3: I ran two simulations with tropical evergreens at BCI for 30 years. One simulation used master 91785 and the other removed calls to reset dbh to match structure. There was no discernible difference in wall-time and the results were nearly identical (528 vs. 521 seconds per day).

In fact, the master actually ran slightly faster. Here are some comparisons.

reset-test_plots.pdf

rgknox commented 6 years ago

@walkeranthonyp , the fixes to canopy conductance were submitted via pull request 379. In summary, we had originally thought that the host land models were expecting FATES to pass it canopy resistance per unit leaf area. But we realized that it should just be mean canopy resistance.

https://github.com/NGEET/fates/pull/379/files

See the diffs in the subroutine ScaleLeafLayerFluxToCohort(), that is where conductance is calculated, and weighted by leaf area for each cohort. We then later normalize by total canopy leaf area to get a mean.

https://github.com/NGEET/fates/pull/379

walkeranthonyp commented 6 years ago

Thanks @rgknox and @ckoven for pointing me to the previous discussion. I think I'm up-to-date (on that at least).

WRT performance issues I'm looking into the photosynthesis calculation side of things. As noted by @everpassenger in issue #379 calculating 40 canopy layers for every canopy photosynthesis calculation is expensive! I've been checking this out with the rudimentary version of canopy MAAT. It's not set up exactly like FATES, but it does have a multi-layer canopy, sun/shade and diffuse/direct partitioning, plus a number of other assumptions. canopy_layers_par The figure shows the difference in instantaneous canopy assimilation compared with a 40 layer canopy calculation at different PAR levels. Colours are 10-50 layers in steps of 10, tested with Medlyn or Ball gs, and an LAI of 3, 5, and 7.

A rough guide is a drop in A of 2-5 % for a 20 layer canopy and 1-2 % for a 30 layer canopy. It's roughly the same across various CO2 and VPD levels. The question is how accurate is sufficient? It's unlikely that in the real world the canopy variables and parameters follow a completely smooth differential anyway.

On another tack. I think I have an idea about how to come up with an analytical solution that includes a non zero rb. Actually one already exists but it's a cubic and no-one really knows what root to use. I think I can come up with a quadratic solution. Just need to find a little time to implement it and to test it out.

rosiealice commented 6 years ago

We shouldn't be doing the calculations for 40 layers always though, right? Only if LAI is actually 40? As in, nlevleaf is the upper bounds for that array size to cope with wierd conditions and/or parameter combinations? Or at least, that's how it is/was supposed to work :)

On Mon, Jul 9, 2018, 4:13 PM walkeranthonyp notifications@github.com wrote:

Thanks @rgknox https://github.com/rgknox and @ckoven https://github.com/ckoven for pointing me to the previous discussion. I think I'm up-to-date (on that at least).

WRT performance issues I'm looking into the photosynthesis calculation side of things. As noted by @everpassenger https://github.com/everpassenger in issue #379 https://github.com/NGEET/fates/pull/379 calculating 40 canopy layers for every canopy photosynthesis calculation is expensive! I've been checking this out with the rudimentary version of canopy MAAT. It's not set up exactly like FATES, but it does have a multi-layer canopy, sun/shade and diffuse/direct partitioning, plus a number of other assumptions. [image: canopy_layers_par] https://user-images.githubusercontent.com/4901605/42478303-3f54d82e-83a2-11e8-998f-c2ec57faf55f.png The figure shows the difference in instantaneous canopy assimilation compared with a 40 layer canopy calculation at different PAR levels. Colours are 10-50 layers in steps of 10, tested with Medlyn or Ball gs, and an LAI of 3, 5, and 7.

A rough guide is a drop in A of 2-5 % for a 20 layer canopy and 1-2 % for a 30 layer canopy. It's roughly the same across various CO2 and VPD levels. The question is how accurate is sufficient? It's unlikely that in the real world the canopy variables and parameters follow a completely smooth differential anyway.

On another tack. I think I have an idea about how to come up with an analytical solution that includes a non zero rb. Actually one already exists but it's a cubic and no-one really knows what root to use. I think I can come up with a quadratic solution. Just need to find a little time to implement it and to test it out.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/386#issuecomment-403637119, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQylDMICrCNUp2vimcsAJMmRDJ_Mgks5uE9V_gaJpZM4ULTJT .

walkeranthonyp commented 6 years ago

Ah OK, I was going on what was said before in these threads (plus I remember a comment you made @rosiealice on a call about Gordon thinking 40 layers was what was needed to properly approximate that integral). That'll teach me to not look at the code in depth.

In any case, we have the tools to investigate how many layers we want to simulate.

Also I want to get it out there that I think an analytical, or semi-analytical, solution to the coupled photosynthesis equations is likely and that will save a good amount of time.

rgknox commented 3 years ago

Place to improve efficiency, we currently allocate some largish arrays using a maxpft global constant, some examples here:

https://github.com/NGEET/fates/blob/8827a6ea77f0e4f5d8408ddd7eeaa776bb3bba8a/main/EDTypesMod.F90#L415-L443

We already do dynamic allocations of various patch level arrays, we could easily change these to use the actual number of pfts instead of the maximum.

rgknox commented 3 years ago

In the following piece of code, we do some pretty laborious loops for diagnostics at the model timestep.

A typical cohort loop (right now) maxes out at 100-150 entities. This loop, assuming 12 pfts and 3 canopy layers and 30 leaf layers would be: 12330 = 1080

https://github.com/NGEET/fates/blob/master/main/FatesHistoryInterfaceMod.F90#L3683-L3686

The solution, aside from encompassing the whole loop structure into some type of hard-coded logical, would require us to determine if any of these variables are being sent to history. I've looked into this before, but I don't remember finding out where the history field list is saved.

@ekluzek @billsacks

billsacks commented 3 years ago

Interesting. I could see how this could be useful in other places, too.

You could get at this information from histFileMod, but I don't know if there will be an easy way to get the information you want. Some issues I see are:

You would need to know the string names of all relevant history variables. Hard-coding these in the place where you're doing the calculations could be a maintenance problem, in that someone might change the history field name but forget to change the check in this other place. (And if the logic is: calculate these diagnostics as long as any of these history variables are present on any file, then you wouldn't notice the problem until someone does a run where none of them are present, so it could easily slip through testing.)
You would probably need to do a bunch of string comparisons against all history variables included in all files. These string comparisons could themselves be a significant expense if you're doing them every time step.

I thought a bit about how we could handle this in a way that avoids these issues. My initial thought would be to create a module like this:

module HistExpensiveDiagnostics

  ! This module defines an enumeration and a logical array for tracking whether a
  ! particular set of expensive diagnostics need to be calculated.
  !
  ! There are two parts to using this:
  !
  ! (1) In a hist_addfld call for a diagnostic that falls into one of these expensive
  ! diagnostic categories, add the optional argument "expensive_diagnostic", with value
  ! set to one of the expensive_diagnostics_* integer values defined below. For example:
  !
  !    hist_addfld1d(..., expensive_diagnostic = expensive_diagnostics_foo)
  !
  ! (2) At the point where this diagnostic variable is calculated, wrap the calculation
  ! in a conditional like:
  !
  !    if (diagnostics_needed(expensive_diagnostics_foo)) then ...

  implicit none
  private

  integer, parameter, public :: expensive_diagnostics_fates_something = 1
  integer, parameter, public :: expensive_diagnostics_something_else  = 2
  integer, parameter, public :: expensive_diagnostics_max             = 2

  logical, public :: diagnostics_needed(expensive_diagnostics_max)

end module HistExpensiveDiagnostics

Then you would need to add some infrastructure in histFileMod that stores the expensive_diagnostic argument for each history variable based on its value (if present) in the addfld call. Then, when a variable is actually added to a history stream, if it has this expensive_diagnostic field set, then the relevant entry in diagnostics_needed (which would have been initialized to .false. everywhere) is set to .true.

This might not be the best design, so happy to hear alternatives. I think it wouldn't be too hard to implement: it would mainly be a matter of making sure you find all the relevant places that need to be changed in histFileMod (as I recall, there are a number of blocks of near-duplicated code in that module, so you probably need to make similar changes in a few places).

rgknox commented 3 years ago

Thanks for your ideas @billsacks On the fates side, all of our history variables are defined as objects. Each object has a unique integer identifier, which is used to find or identify the variable when it comes time for use. Definition call for one of our expensive variables: https://github.com/NGEET/fates/blob/master/main/FatesHistoryInterfaceMod.F90#L5110-L5114 Alias call: https://github.com/NGEET/fates/blob/master/main/FatesHistoryInterfaceMod.F90#L3492 Usage: https://github.com/NGEET/fates/blob/master/main/FatesHistoryInterfaceMod.F90#L3711-L3712

I feel we could add your diagnostics_needed as a component of the object. EDIT: Or...it might be better as an array like you say, because attaching it as a scalar logical to the object would prevent us from doing checks on groups of variables. I think I like the array method better...

The string matching would only need to be done once on initialization, and to save time, we would only do string matching on the variables that are flagged expensive, so shouldn't be too bad. I think we have some string matching utitlities laying around here somewhere

ckoven commented 2 years ago

NGEET / fates

performance issues and topics #386

--

--