Improve performance by turning off calls to expensive heat stress indices by default

olyson commented 6 years ago

Sean has recently noted that the HumanIndexMod accounts for a substantial part of CLM5's computational burden. Bill Sacks had noted this back in August of 2016. I've included a record of our discussions on this topic below. To summarize, we had decided to turn off the expensive wet bulb calculation and associated heat stress indices by default through a namelist parameter. I had provided a test case for timing purposes (simply commenting out the subroutine calls) but evidently didn't follow up beyond that. The test case noted below still exists in my work directory.

Email thread:

Hi Erik,

Here are the source mods where I tested the speedup of the heat stress module by commenting out the wet bulb calculation and the associated indices. As we discussed, this could be namelist controlled.

/glade/p/work/oleson/clm4_5_11_r189/cime/scripts/SMS_Lm1.f09_g16_gl5.IG1850CRUCLM50BGC.yellowstone_intel.clm-clm50KitchenSink.160821-130538/SourceMods/src.clm

Thanks,

Keith

On Mon, Aug 22, 2016 at 10:10 AM, David Lawrence dlawren@ucar.edu wrote: I like option 2. I think we want to keep some heat stress indices in the normal output by default, but cutting down on the number would be beneficial from computational and storage perspectives.

Dave

On Sun, Aug 21, 2016 at 7:07 PM, Bill Sacks sacks@ucar.edu wrote: Hi Keith,

Thanks a lot for your detailed investigation of this!

Note that this test is dominated by i/o time, since it does daily output by default. So I wouldn't make too much of the quantitative performance changes (unless you subtracted out the i/o time from the timing files), but the results are probably valid qualitatively. Any of your suggestions seem reasonable to me, so I'm happy to defer to others like you who have a better sense of the pros & cons of the different solutions.

Thanks, Bill

On Aug 21, 2016, at 2:02 PM, Keith Oleson oleson@ucar.edu wrote:

I haven't heard anything more about this, so I did some tests using the test Bill was using ( SMS_Lm1.f09_g16_gl5.IG1850CRUCLM50BGC.yellowstone_intel.clm-clm50KitchenSink) .

Even though clm4_5_9_r186 doesn't actually have PHS on when standard cases are setup because of a bug, it does seem to be on during the test, so that wouldn't be the source of the slowdown.

I get about a 5% decrease in performance between clm4_5_11_r189 and clm4_5_9_r186 for that particular test (using yrs/day as a metric for that 1-month test).

Using clm4_5_11_r189:

I get about a 6.5% increase in performance by turning the heat stress indices completely off. This case is then 1.2% faster than clm4_5_9_r186.

I get about a 4.3% increase in performance by setting convergence = 0.01 and max_iter = 100. This case is only about 0.8% slower than clm4_5_9_r186. The changes in wet bulb and associated indices are on the order of a few hundreths of a degree. No instances of non-convergence in this admittedly short test.

I get about a 6% increase in performance by eliminating the calls to wetbulb and associated indices. This case is then 0.8% faster than clm4_5_9_r186.

So, I see at least three options:

Turn off heat stress indices completely.
Turn off wet bulb and associated indices. This would still give us a wet bulb calculation ("Stull" version; the wet bulb by itself is used frequently as a heat stress index) and a few other commonly used heat stress indices.
Find other means of reducing the computational requirements of the wet bulb calculation other than reducing convegence criteria and max iterations.

I would prefer 2, this gives us a few commonly used indices by default including a non-computationally intensive wet bulb calculation and results in a slight speedup over even clm4_5_9_r186. This could be namelist controlled (sorry, another one). This would also be better for production runs as it actually eliminates 24 history fields because there are three variable types associated with each variable (rural, urban, and grid-cell average).

The increase in maximum iterations from 2 to 10000 was associated with a different method of solving the equations which required a more brute force, but more robust, method of solution that Jonathan implemented. But setting convergence = 0.01 and max_iter = 100 is probably acceptable for most applications.

Keith

On Fri, Aug 19, 2016 at 10:24 AM, Keith Oleson oleson@ucar.edu wrote: Just another thought. Bill is comparing r189 to r186. At r186, PHS was supposedly on. But it actually wasn't on until r187. Do the timing numbers for r186 actually reflect PHS being on?

Keith

On Fri, Aug 19, 2016 at 10:13 AM, Keith Oleson oleson@ucar.edu wrote: If it's the heat stress indices causing the problem, then the cause could be the new wet bulb iteration. The wet bulb temperature is needed for several of the heat stress indices. You could reduce the convergence criteria and see if that makes a difference. It is currently set to convergence = 0.00001_r8. I think you could try 0.01_r8 and see if that makes a difference.

Alternatively, you could turn off the wet bulb calculation by default and turn off the heat indices that require that (discomfort index, temperature humidity index, and the swamp cooling efficiency).

I implemented the streams following the example of other streams that were already in there. Do we see comparable slowdowns when we turn on, e.g, the lai stream?

Keith

On Fri, Aug 19, 2016 at 9:54 AM, David Lawrence dlawren@ucar.edu wrote: Is it really the human stress indices that are causing the slowdown or is it something else in the new urban stuff (new streams file?)? There are a lot of stress indices, seems like too many to me, but maybe there is a good rationale for using all of them. If we don't need all of them, could we turn on just the 'best' one by default and then turn on others for specific projects that might need them.

On Fri, Aug 19, 2016 at 9:43 AM, Erik Kluzek erik@ucar.edu wrote: Hi Bill

Thanks for pointing that out. I didn't look into timing issues, really just based on the assumption that the changes weren't large enough to matter.

The human stress indices are on for clm5 and off for 45. We could turn them off by default for more cases. Should we so that? I also suspect that some of this could be sped up.

So we could change the defaults on if it's on or not as an easy change. I can file a bug about it being slow and get to that later. Does that sound like what we should do?

Erik Kluzek, (CGD at NCAR)

National Center for Atmospheric Research

Boulder CO,

(off) (303)497-1326 (fax) (303)497-1348

(skype) ekluzek (cell) (303)859-0183

Pronouns: he/his/him

------------------ Home page ------------------------

      http://www.cgd.ucar.edu/~erik

      https://staff.ucar.edu/users/erik

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

On Fri, Aug 19, 2016 at 9:01 AM, Bill Sacks sacks@ucar.edu wrote: Hi Erik,

I noticed that this test:

ERP_P15x2_Lm36.f10_f10.ICRUCLM50BGCCROP.yellowstone_intel.clm-cropMonth_interp

takes a bit over 2 hours in my branch off of r189, whereas it took just ~ 103 min in r186.

A similar CLM45 test:

ERP_P15x2_Lm36.f10_f10.ICLM45BGCCROP.yellowstone_intel.clm-irrig_o3_reduceOutput

takes almost the exact same amount of time in r186 and r189.

A test at the production resolution:

SMS_Lm1.f09_g16_gl5.IG1850CRUCLM50BGC.yellowstone_intel.clm-clm50KitchenSink

takes 7% longer in my branch off of r189 than it did in r186... but if you subtract i/o time (which is the majority of the time in this test), then it takes about 12% longer for the total CLM computation time.

I'm wondering if this slowdown may be due in part to the new urban diagnostics that (as I understand it) are now on by default? From a glance at the code, it looks like they involve some expensive operations.

Big culprits I see in the timing files are the following:

New: urbantvdyn_strd_adv_total 893400 600 0.130 0.001 0.147 ( 494) 0.128 ( 66)

New: bgflux 893400 600 2.810 0.327 3.883 ( 529) 1.921 ( 254) canflux 893400 600 17.024 2.545 24.263 ( 21) 11.629 ( 7)

vs

Old: bgflux 893400 600 0.566 0.059 0.791 ( 529) 0.420 ( 254) canflux 893400 600 13.177 2.219 18.920 ( 179) 8.584 ( 492)

and

New: uflux 893400 600 3.844 1.266 9.333 ( 284) 0.682 ( 97) bgplake 893400 600 0.375 0.120 0.759 ( 492) 0.062 ( 296)

vs

Old: uflux 893400 600 0.306 0.075 0.575 ( 284) 0.108 ( 564) bgplake 893400 600 0.087 0.020 0.143 ( 140) 0.033 ( 32)

If this is known & expected, then it's okay. But I thought I'd point it out since the ChangeLogs don't mention anything about expected timing increases.

Bill

CLM-CMT mailing list CLM-CMT@cgd.ucar.edu http://mailman.cgd.ucar.edu/mailman/listinfo/clm-cmt

billsacks commented 5 years ago

@ekluzek @olyson I think this is done, right? Can we close this issue?

ekluzek commented 5 years ago

Yes, we now have the ability to do heat stress indices for three different levels. Turn all the way off, turn the expensive ones off, or do everything.

ESCOMP / CTSM

Improve performance by turning off calls to expensive heat stress indices by default #316