Open mabruzzo opened 2 years ago
Here's the code for the branchless exponential function:
# in the file where you define your branchless exponential function,
# you want to make sure that the machine is little endian. The following
# code does that check for you...
const LITTLE_ENDIAN = (ENDIAN_BOM == 0x04030201)
@assert LITTLE_ENDIAN
# polynomial coefficients in correction were taken from
# https://gitlab.mpcdf.mpg.de/bbramas/inastemp/-/blob/master/Src/Common/InaFastExp.hpp
# (That above repository has an MIT license that needs to be noted somewhere if this is used,
# unless you derive them yourself)
# you can adjust the order of the polynomials to adjust accuracy (see the above link for
# more options).
#
# Some random spot-checks seemed to indicate that the following polynomial is accurate to
# better than 1e-5. But my checks were NOT rigorous
@inline function correction(x::F) where {F<:AbstractFloat}
coef_1, coef_2, coef_3, coef_4, coef_5 = (-3.70138142771437266806e-06,
3.07033820309224325662e-01,
-2.41638288055762540107e-01,
-5.16904731562965388814e-02,
-1.36976563343097993558e-02)
(((((coef_5 * x + coef_4) * x + coef_3) * x) + coef_2) * x) + coef_1
end
# this whole approach is based on
# https://www.researchgate.net/publication/272178514_Fast_Exponential_Computation_on_SIMD_Architectures
const COEFF_LOG2E = log2(ℯ)
const COEFF_A = 2^52
const COEFF_B = 2^52*1023
@inline my_ifelse(cond, a, b) = cond*a + (1-cond) * b
@inline function branchless_exp(x::Float64)
# inrange and my_ifelse are present purely for error handling.
# When x^2 > 709.7, they make the function returns -1
# Note: This error handling makes the function ~50% slower
# (but my benchmarks still suggest that it's ~3 times faster
# than exp)
inrange = (x^2 < 709.7^2)
x = my_ifelse(inrange, x, 0) * COEFF_LOG2E
fractional_part::Float64 = x - floor(x)
x -= correction(fractional_part)
casted_integer::Int64 = unsafe_trunc(Int64, COEFF_A * x + COEFF_B)
my_ifelse(inrange, reinterpret(Float64, casted_integer),-1.0)
end
# keep in mind: to use this function with ForwardDiff, you'll have to
# define a specialization of `branchless_exp` that takes a DualNumber.
# If you look at dual.jl, within ForwardDiff.jl, they do this for
# functions like `sincospi`
When you broadcast over a call to this function, Julia has a really easy time auto-vectorizing. It would take a little effort to write a version of this function for 32-bit floats
$\exp(-\tau)$ is also a bottleneck in the radiative transfer calculations, which I have been looking at lately, so I'm returning to this now.
Dumb Q: why x^2 < 709.7^2
and not x < 702.9
?
That gitlab repo is down, but I think this is the same: https://gitlab.inria.fr/bramas/inastemp/-/blob/master/Src/Common/InaFastExp.hpp
I think the level of precision is kinda on the border of being adequate. Since these values get used in integrals over the (~50) atmospheric layers, you could conceivably accumulate another factor of 100 in error. Below the fold is a version with an additional coefficient that is accurate to $10^{-7}$ (for the values used in the timing tests below). It's about 15-20% slower than you posted version, as expected.
This is the more precise version with bounds-checking turned off. For the range of $\tau$ values that we get in the RT calculations, and in the voigt calculations. All errors are under 1e-7
. This in on a macbook pro with an M2 pro with Julia 1.10.
ntaus = -(10 .^ (-5:0.0001:2)) # values typical for RT
@btime branchless_exp.($ntaus) # 51.125 μs (2 allocations: 547.05 KiB)
@btime exp.($ntaus). # 205.333 μs (2 allocations: 547.05 KiB)
v2s = 0:0.0001:5 # values typical for voigt
@btime branchless_exp.($v2s) # 154.250 μs (2 allocations: 390.80 KiB)
@btime exp.($v2s) # 197.917 μs (2 allocations: 390.80 KiB)
It seems like the gains are bigger for RT, but substantial for both. Am I missing any reason why we should not start using this?
(For posterity, because I had to google this: essentially everything is little-endian nowadays.)
I thought this was going to be a quick response, but it turns out I have a lot of thoughts on the matter. If the tone sounds at all negative, my apologies -- that's definitely not my intent. I was just trying to get all my thoughts down on the page as quickly as possible them all down.
Additionally, when I started writing this up, I was writing it assuming that vectorized code was the ultimate goal. But as I've thought about this more, you probably won't achieve that unless you make a bunch of other changes (at least inside the voigt-profile). So in the short-term, we are probably focused on performance-improvements within non-vectorized code (which you may still achieve do to the removal of branches).
Let me also be clear, all of this is based on intuition. It's possible my intuition may be totally wrong (since it's been a while). Hopefully, my response is sounds somewhat coherent and is at least somewhat useful...
Dumb Q: why x^2 < 709.7^2 and not x < 702.9?
I have no idea why I proposed 709.7 instead of 702.9... If nothing else, my version should probably be corrected to x^2 < 702.9^2
.
I think there are essentially 3 logically equivalent options for implementing the logic I was going after (assuming that there is no NaN
values):
(-702.9 < x) & (x< 702.9)
: this is composed of 2 comparison and a bitwise-andabs(x) < 702.9
: this is composed of 1 comparison and 1 abs
evaluation.x^2 < 702.9^2
: because of Julia JIT magic this is really just x^2 < constant
at runtime, which is composed 1 multiplication and 1 comparison.Choosing between one of these choices is a micro-optimization and the optimal choice is fuzzy. I was probably considering vectorization at the time. In that context, the best choice may differ based on the precise CPU-architecture and it may depends on the effectiveness of the LLVM auto-vectorizer).
We can talk through this a little:
abs
into a vectorized bitwise operation. We may need to figure out how to tell it to explicitly assume that x
is not NaN
or inf
to achieve the desired result (I don't think a bitwise-and could be to implement abs(NaN)
, but I'm not sure). Even if we figured that out, it still may not do what we want (auto-vectorization is notoriously difficult).All of this is to say that I choose x^2 < 702.9^2
since it seemed like it would be most likely to work best on most architectures. But it's certainly possible that something else may be a little better (my intuition about all of this may be wrong).
If you knew that code-paths within the calling function ALWAYS ensured that the value was non-positive or non-negative, you could obviously get an additional speed-up.
This is the more precise version with bounds-checking turned off. For the range of values that we get in the RT calculations, and in the voigt calculations. All errors are under 1e-7. This in on a macbook pro with an M2 pro with Julia 1.10. ... It seems like the gains are bigger for RT, but substantial for both. Am I missing any reason why we should not start using this?
I would have 2 concerns: correctness and benchmarking accuracy.
Your comment says that this will return NaN when X > 702.9
. How completely sure are you about that?
X < -702.9
?NaN
(I could totally be wrong of course! I don't remember exactly what trick we're using).NaN
, I would be concerned about non-NaN
values appearing and silently invalidating the calculation.About the benchmarking itself:
My memory is a little fuzzy, but based on my past experience (with older Julia versions), I think I needed to provide more hints to convince the optimizer to auto-vectorize. I suspect your exp-approximation is probably not vectorized (which again is very encouraging if it's faster than the normal exponential -- since you could view it as a lower performance bound).
(For posterity, because I had to google this: essentially everything is little-endian nowadays.)
For further posterity, I know PowerPC was big-endian... IBM Power9 is a related architecture (I'm not sure of it's endianness) that's still around and used in clusters (like Summit), and I've actually used it quite a bit in this past year. Plus I think certain architectures can be configured to work with either endianness or mixed endianness?
I have to apologize, I had a dumb simultaneous thinko and typo that slightly derailed this. The 709.7
vs 702.9
thing was just me being bad at typing, and the question was really about squaring both sides of the inequality. My mind was so fixed on nonnegative physical quantities that I forgot about the possibility of negative numbers for a moment.
Regarding the comment saying that the function will return NaN when out of bounds: I'm not sure and will change it to say that it's undefined.
In the voigt profile context, the argument to $exp$ is guaranteed always to be in the range $(-25, 0]$. In the RT context, it can in in principle be an arbitrarily large nonnegative number. However, I think it may be possible to efficiently ensure that it as always less than some reasonable value. Cutting off RT calculations at $\tau=10$ or so will probably save enough operations to be worth potentially screwing up vectorization in some places.
Regarding the comment saying that the function will return NaN when out of bounds: I'm not sure and will change it to say that it's undefined.
Yeah, having just now spent a few minutes looking at the wikipedia article about double-precision representation, I have a slightly better appreciation. For posterity:
x > 709.7
, the behavior depends a little on just how much bigger x
is.
transform(x) = COEFF_A * (x - correction(x-floor(x))) + COEF_B
transform(x) < 2^63
, then a 64-bit signed integer can always represent the nearest integral value of transform(x)
(i.e. unsafe_trunc
has well defined behavior) and I believe that the output is always a NaN
transform(x) >= 2^63
, unsafe_trunc
returns an arbitrary value and all bets are off.x < -709.7
, I think you might end up with 0. But, I'm not quite sure.In the voigt profile context, the argument to exp is guaranteed always to be in the range (−25,0]. In the RT context, it can in in principle be an arbitrarily large nonnegative number. However, I think it may be possible to efficiently ensure that it as always less than some reasonable value. Cutting off RT calculations at τ=10 or so will probably save enough operations to be worth potentially screwing up vectorization in some places.
Yeah, if you have hard guarantees about the arguments falling into a certain range, then by all means, drop the bounds-check. (I would just be very explicit about it in comments)
As promised, I'm posting my faster exponential function. I'm curious to hear if using this inside of
voigt
buys you any speed.Tonight, I also put together a branchless version of
voigt
. Unfortunately, all of the extra work that is being done by the branchless voigt makes it slower than the unvectorized version. I suspect that if someone spent a lot of time refactoring, they might be able to get it to be slightly faster. But I don't think that's worth the time. As you had suggested a while back, I think a better use of time would be to replace the current approximation with a better approach.I did some googling and the consensus seems to be that we should leverage the relationship of the Voigt function with the real function. According to eqn 9 of Thompson (93),
V(u,a) = Re[exp(z²) erfc(z)]
, wherez = a + i*u
. I suspect that nobody will have time to pursue this for a while, so I'll briefly document what I read. I've seen 2 ways to do this (there are probably more):I think that Turbospectrum's implementation is based on the same underlying approach as Kuntz (1997). (Interestingly, I think that turbospectrum and the article might be suggesting that for
V(a,v)
,z = a - iv
...). Under this approach, I think that they directly approximate the whole expressionRe[exp(z²) erfc(z)]
, but it introduces at least some level of branching. I think this article talks a little about variants of this approach.At a glance, I suspect this might potentially be slightly faster to adopt a strategy like the one used in this article (if we target a particular accuracy). They seemed to use an approximation for
exp(z²) erfc(z) ≈ (Σⱼᵖ aⱼ zʲ)/(z⁽ᵖ⁺¹⁾ + Σⱼᵖ bⱼ zʲ)
with coefficients from here, and directly computedexp(z²)
. The benefit of this approach is that there is basically no-branching. However, pluggingRe(exp(-1*(x + i*y)^2) * (a + i b))
into wolfram alpha suggests that you also need to evaluate sin and cos (which I think boils down to 2 more polynomial evaluations)...If either of these approaches are adopted, I believe some care is required to avoid using Julia's complex datatype (since I think ForwardDiff.jl is incompatible with complex numbers)