Inefficient coding in P3

quantheory commented 11 months ago

So, @ambrad pointed me to #1722 the other day, and I realized that a lot of the cost of the P3 rain sedimentation is probably in the many get_rain_dsd2 calls in the rain fall velocity calculations (which is a prerequisite for the lookup table stuff). There's a lot of inefficient stuff in there!

First, note these lines: https://github.com/E3SM-Project/scream/blob/f412870ca05da69fa26698aa22bb47b4fc24ef20/components/eamxx/src/physics/p3/impl/p3_dsd2_impl.hpp#L100-L107 In this case, inv_dumdoesn't actually seem to be used in the current version, so I'm hoping that the compiler is able to optimize out its calculation completely. But also, since mu_r is currently treated as a constant, inv_dum is just a constant divided by lamr. So calculating the cube root twice would be unnecessary anyway if cbrt((mu_r + 3) * (mu_r + 2) * (mu_r + 1) / 6.) was calculated once at initialization, and the value saved for all further calculations.

Second, just below that: https://github.com/E3SM-Project/scream/blob/f412870ca05da69fa26698aa22bb47b4fc24ef20/components/eamxx/src/physics/p3/impl/p3_dsd2_impl.hpp#L123-L127 This looks like it is just a translation from the Fortran version of P3, but the original P3 code itself is very weird. This code is calculating the mathematical equivalent of:

nr[s] = lamr[s]*lamr[s]*lamr[s] * qr[s] / ((mu_r+1) * (mu_r+2) * (mu_r+3) * cons1);

That's it; none of the exp, log, or tgamma calls are necessary here. Also, you can see that the denominator is an expression that has already been used above, so if you really wanted to optimize, that could be saved and reused.

Third, right below that, you have: https://github.com/E3SM-Project/scream/blob/f412870ca05da69fa26698aa22bb47b4fc24ef20/components/eamxx/src/physics/p3/impl/p3_dsd2_impl.hpp#L130-L132 If mu_r is a constant, then tgamma(mu_r + 1.) can be calculated at initialization and saved, thus removing tgamma from this performance-critical code. Furthermore, the second line can be reduced to:

logn0r.set(qr_gt_small, log10(cdistr) + (mu_r + 1) * log10(lamr));

This reduces the number of log10 calls by one. Even better, the calculation of these variables could be split out into a separate function from get_rain_dsd2 entirely, since I do not believe cdistr or logn0r are needed in the sedimentation loop at all.

I am not planning to make any of these changes myself, at least right now. However, since the above optimizations should collectively remove all of the transcendental function calls from the performance-critical sedimentation loop (leaving only a single cube root and a few arithmetic operations), it might be worth it for someone to try to implement these changes, and see if there are any similar improvements are to be found in the ice code.

(Of course these optimizations also are, or will be, in the modified P3 code we're developing as part of our SciDAC project.)

bartgol commented 11 months ago

Nice. Looks like the most noticeable impact would come from avoiding trascendental math functions calls. Also, if we allow considering that mu_r=1, you could save another log in

logn0r.set(qr_gt_small, log10(cdistr) + (mu_r + 1) * log10(lamr));

by doing

logn0r.set(qr_gt_small, log10(cdistr*lamr*lamr));

quantheory commented 11 months ago

@bartgol Yep, you're right, although I believe mu_r is always set to 0 currently, so that would be:

logn0r.set(qr_gt_small, log10(cdistr*lamr));

Though I'm not fond of mixing code that uses mu_r as an input with code that assumes that mu_r has a particular value. It makes it unclear whether mu_r is supposed to be adjustable or not.

If we wanted to only support mu_r == 0 going forward, it would probably be best to simply remove it entirely, e.g. also replacing (mu_r + 1) * (mu_r + 2) * (mu_r + 3) with just the number 6, and similar measures taken wherever mu_r appears.

bartgol commented 11 months ago

I see static constexpr Scalar mu_r_const = 1.0; in the physics constants file. But whatever.

I agree that if something is meant to be tunable or somewhat changed, we should not "compile it away" in the code, so perhaps my previous comment should be disregarded.

quantheory commented 11 months ago

@bartgol Ah, thanks for letting me know. Now that I've double-checked, I realize that EAMv3 has retuned this value to 0, but SCREAM is using 1, which is probably the source of my confusion.

E3SM-Project / scream

Inefficient coding in P3 #2565