Faster `voigt`, `exp` - Githubissues

As promised, I'm posting my faster exponential function. I'm curious to hear if using this inside of voigt buys you any speed.

Tonight, I also put together a branchless version of voigt. Unfortunately, all of the extra work that is being done by the branchless voigt makes it slower than the unvectorized version. I suspect that if someone spent a lot of time refactoring, they might be able to get it to be slightly faster. But I don't think that's worth the time. As you had suggested a while back, I think a better use of time would be to replace the current approximation with a better approach.

I did some googling and the consensus seems to be that we should leverage the relationship of the Voigt function with the real function. According to eqn 9 of Thompson (93), V(u,a) = Re[exp(z²) erfc(z)], where z = a + i*u. I suspect that nobody will have time to pursue this for a while, so I'll briefly document what I read. I've seen 2 ways to do this (there are probably more):

I think that Turbospectrum's implementation is based on the same underlying approach as Kuntz (1997). (Interestingly, I think that turbospectrum and the article might be suggesting that for V(a,v), z = a - iv...). Under this approach, I think that they directly approximate the whole expression Re[exp(z²) erfc(z)], but it introduces at least some level of branching. I think this article talks a little about variants of this approach.
At a glance, I suspect this might potentially be slightly faster to adopt a strategy like the one used in this article (if we target a particular accuracy). They seemed to use an approximation for exp(z²) erfc(z) ≈ (Σⱼᵖ aⱼ zʲ)/(z⁽ᵖ⁺¹⁾ + Σⱼᵖ bⱼ zʲ) with coefficients from here, and directly computed exp(z²). The benefit of this approach is that there is basically no-branching. However, plugging Re(exp(-1*(x + i*y)^2) * (a + i b)) into wolfram alpha suggests that you also need to evaluate sin and cos (which I think boils down to 2 more polynomial evaluations)...

If either of these approaches are adopted, I believe some care is required to avoid using Julia's complex datatype (since I think ForwardDiff.jl is incompatible with complex numbers)

Here's the code for the branchless exponential function:

# in the file where you define your branchless exponential function,
# you want to make sure that the machine is little endian. The following
# code does that check for you...
const LITTLE_ENDIAN = (ENDIAN_BOM == 0x04030201)
@assert LITTLE_ENDIAN

# polynomial coefficients in correction were taken from 
# https://gitlab.mpcdf.mpg.de/bbramas/inastemp/-/blob/master/Src/Common/InaFastExp.hpp
# (That above repository has an MIT license that needs to be noted somewhere if this is used,
# unless you derive them yourself)
# you can adjust the order of the polynomials to adjust accuracy (see the above link for
# more options).
#
# Some random spot-checks seemed to indicate that the following polynomial is accurate to
# better than 1e-5. But my checks were NOT rigorous 
@inline function correction(x::F) where {F<:AbstractFloat}
    coef_1, coef_2, coef_3, coef_4, coef_5 = (-3.70138142771437266806e-06,
                                               3.07033820309224325662e-01,
                                              -2.41638288055762540107e-01,
                                              -5.16904731562965388814e-02,
                                              -1.36976563343097993558e-02)
    (((((coef_5 * x + coef_4) * x + coef_3) * x) + coef_2) * x) + coef_1
end

# this whole approach is based on
# https://www.researchgate.net/publication/272178514_Fast_Exponential_Computation_on_SIMD_Architectures
const COEFF_LOG2E = log2(ℯ)
const COEFF_A = 2^52
const COEFF_B = 2^52*1023

@inline my_ifelse(cond, a, b) = cond*a + (1-cond) * b

@inline function branchless_exp(x::Float64)
    # inrange and my_ifelse are present purely for error handling.
    # When x^2 > 709.7, they make the function returns -1
    # Note: This error handling makes the function ~50% slower
    # (but my benchmarks still suggest that it's ~3 times faster
    # than exp)

    inrange = (x^2 < 709.7^2)
    x = my_ifelse(inrange, x, 0) * COEFF_LOG2E
    fractional_part::Float64 = x - floor(x)

    x -= correction(fractional_part)
    casted_integer::Int64 = unsafe_trunc(Int64, COEFF_A * x + COEFF_B)
    my_ifelse(inrange, reinterpret(Float64, casted_integer),-1.0)
end

# keep in mind: to use this function with ForwardDiff, you'll have to 
# define a specialization of `branchless_exp` that takes a DualNumber.
# If you look at dual.jl, within ForwardDiff.jl, they do this for
# functions like `sincospi`

When you broadcast over a call to this function, Julia has a really easy time auto-vectorizing. It would take a little effort to write a version of this function for 32-bit floats

$\exp(-\tau)$ is also a bottleneck in the radiative transfer calculations, which I have been looking at lately, so I'm returning to this now.

Dumb Q: why x^2 < 709.7^2 and not x < 702.9?

That gitlab repo is down, but I think this is the same: https://gitlab.inria.fr/bramas/inastemp/-/blob/master/Src/Common/InaFastExp.hpp

I think the level of precision is kinda on the border of being adequate. Since these values get used in integrals over the (~50) atmospheric layers, you could conceivably accumulate another factor of 100 in error. Below the fold is a version with an additional coefficient that is accurate to $10^{-7}$ (for the values used in the timing tests below). It's about 15-20% slower than you posted version, as expected.

```julia # in the file where you define your branchless exponential function, # you want to make sure that the machine is little endian. The following # code does that check for you. const LITTLE_ENDIAN = (ENDIAN_BOM == 0x04030201) @assert LITTLE_ENDIAN # polynomial coefficients in correction were taken from # https://gitlab.inria.fr/bramas/inastemp/-/blob/master/Src/Common/InaFastExp.hpp # That projects's license is included here: # Copyright (c) 2016 Berenger Bramas - MPCDF # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. @inline function correction(x::F) where {F<:AbstractFloat} #coef_1, coef_2, coef_3, coef_4, coef_5 = (-3.70138142771437266806e-06, # 3.07033820309224325662e-01, # -2.41638288055762540107e-01, # -5.16904731562965388814e-02, # -1.36976563343097993558e-02) #(((((coef_5 * x + coef_4) * x + coef_3) * x) + coef_2) * x) + coef_1 c1, c2, c3, c4, c5, c6 = (1.06823753710239477000e-07, 3.06845249656632845792e-01, -2.40139721982230797126e-01, -5.58662282412822480682e-02, -8.94283890931273951763e-03, -1.89646052380707734290e-03) ((((((c6 * x + c5) * x + c4) * x + c3) * x) + c2) * x) + coef_1 end # this whole approach is based on # https://www.researchgate.net/publication/272178514_Fast_Exponential_Computation_on_SIMD_Architectures const COEFF_LOG2E = log2(ℯ) const COEFF_A = 2^52 const COEFF_B = 2^52*1023 # this will return NaN when X > 702.9 @inline function branchless_exp(x::Float64) x *= COEFF_LOG2E fractional_part::Float64 = x - floor(x) x -= correction(fractional_part) casted_integer::Int64 = unsafe_trunc(Int64, COEFF_A * x + COEFF_B) reinterpret(Float64, casted_integer) end # TODO: method for duals ```

This is the more precise version with bounds-checking turned off. For the range of $\tau$ values that we get in the RT calculations, and in the voigt calculations. All errors are under 1e-7. This in on a macbook pro with an M2 pro with Julia 1.10.

ntaus = -(10 .^ (-5:0.0001:2)) # values typical for RT
@btime branchless_exp.($ntaus) # 51.125 μs (2 allocations: 547.05 KiB)
@btime exp.($ntaus).           # 205.333 μs (2 allocations: 547.05 KiB)

v2s = 0:0.0001:5             # values typical for voigt
@btime branchless_exp.($v2s) # 154.250 μs (2 allocations: 390.80 KiB)
@btime exp.($v2s)            # 197.917 μs (2 allocations: 390.80 KiB)

It seems like the gains are bigger for RT, but substantial for both. Am I missing any reason why we should not start using this?

(For posterity, because I had to google this: essentially everything is little-endian nowadays.)

I thought this was going to be a quick response, but it turns out I have a lot of thoughts on the matter. If the tone sounds at all negative, my apologies -- that's definitely not my intent. I was just trying to get all my thoughts down on the page as quickly as possible them all down.

Additionally, when I started writing this up, I was writing it assuming that vectorized code was the ultimate goal. But as I've thought about this more, you probably won't achieve that unless you make a bunch of other changes (at least inside the voigt-profile). So in the short-term, we are probably focused on performance-improvements within non-vectorized code (which you may still achieve do to the removal of branches).

Let me also be clear, all of this is based on intuition. It's possible my intuition may be totally wrong (since it's been a while). Hopefully, my response is sounds somewhat coherent and is at least somewhat useful...

Dumb Q: why x^2 < 709.7^2 and not x < 702.9?

I have no idea why I proposed 709.7 instead of 702.9... If nothing else, my version should probably be corrected to x^2 < 702.9^2.

I think there are essentially 3 logically equivalent options for implementing the logic I was going after (assuming that there is no NaN values):

(-702.9 < x) & (x< 702.9): this is composed of 2 comparison and a bitwise-and
abs(x) < 702.9: this is composed of 1 comparison and 1 abs evaluation.
x^2 < 702.9^2: because of Julia JIT magic this is really just x^2 < constant at runtime, which is composed 1 multiplication and 1 comparison.

Choosing between one of these choices is a micro-optimization and the optimal choice is fuzzy. I was probably considering vectorization at the time. In that context, the best choice may differ based on the precise CPU-architecture and it may depends on the effectiveness of the LLVM auto-vectorizer).

in terms of CPU-architecture, the thing that matters are the available SIMD-extension instructions. As you probably remember, these are special instructions that will operate on multiple values, stored in special "vector-registers". For example you need a specialized simd-multiplication instruction and a specialized simd-addition instruction... On x86-64 the different SIMD instructions are historically referred to as SSE-1, SSE-2, .... (for 128-bit registers) and the AVX families of instructions for wider vector-registers. I know that SIMD instructions on ARM sometimes use NEON instructions (but Apple might do somethings different). I don't know what is used on Power9 or RISC-V... Plus the relative performance of various instructions may depend on CPU generation.
Realistically, it isn't worth dealing with all the details. It is easiest to assume some common minimal subset of basic functionality and let the compiler deal with the details (and we could potentially provide custom code-paths for specific common hardware). When I wrote this code, I was aiming to have a single implementation that was easy to read and that was relatively portable (i.e. it produced fast-enough code on all architectures and produced relatively consistent results).
But, if we wrote separate implementations of the function for different architectures, we might choose different choices on different architectures.

We can talk through this a little:

Choice 1 is a little undesirable for 2 reasons:
- because you need to load 2 different constants into 2 different vector registers (on some architectures there are only a few vector registers and it can be expensive to load the values from the L1 cache into the registers -- so you generally want to load as few as possible) or you load -702.9 and subsequently convert that inplace to +702.9 (adding a new operation)
- it involves more instructions than the other choices. Now a simd-multiplication may be more expensive than a comparison and a bitwise-and, but I think that is architecture dependent (instruction-latency also is an issue, but I don't remember much about that part)
Choice 3 involves is arguably the case that is most likely to work on the most architectures
- it involves a multiplication and a comparison. Realistically, any think any cpu where we might achieve any must support these simd-instructions.
On certain architectures, choice 2 might be optimal:
- if you know that the inputs are never NaNs, I think option 2 could be implemented as a bitwise-and operation, which may be faster than a multiplication. However, I vaguely recall that on some architectures that maybe the CPU uses different vector-registers for holding floating point values and for holding integers (and usually bitwise-operations assume integer datatypes?)
- Aside from that, it's also not clear that we could get the llvm auto-vectorization to convert the abs into a vectorized bitwise operation. We may need to figure out how to tell it to explicitly assume that x is not NaN or inf to achieve the desired result (I don't think a bitwise-and could be to implement abs(NaN), but I'm not sure). Even if we figured that out, it still may not do what we want (auto-vectorization is notoriously difficult).
- But maybe this would work better on more architectures... (I honestly don't know!)

All of this is to say that I choose x^2 < 702.9^2 since it seemed like it would be most likely to work best on most architectures. But it's certainly possible that something else may be a little better (my intuition about all of this may be wrong).

If you knew that code-paths within the calling function ALWAYS ensured that the value was non-positive or non-negative, you could obviously get an additional speed-up.

This is the more precise version with bounds-checking turned off. For the range of values that we get in the RT calculations, and in the voigt calculations. All errors are under 1e-7. This in on a macbook pro with an M2 pro with Julia 1.10. ... It seems like the gains are bigger for RT, but substantial for both. Am I missing any reason why we should not start using this?

I would have 2 concerns: correctness and benchmarking accuracy.

Your comment says that this will return NaN when X > 702.9. How completely sure are you about that?

What about when X < -702.9?
Part of the concern here is that we are doing "bit-manipulation tricks" here. I'm a little skeptical that all values outside of this range cleanly map to NaN (I could totally be wrong of course! I don't remember exactly what trick we're using).
If the values are not guaranteed to map to NaN, I would be concerned about non-NaN values appearing and silently invalidating the calculation.

About the benchmarking itself:

It's not clear to me whether or not the exp-approximation is vectorized in your benchmarks. Either way, it's probably worth checking that the approximation isn't any slower than the true exponential on a slightly older (maybe 4-year-old or 5-year-old) AMD or Intel CPU. (I would be happy to try on Monday or Tuesday once I am home).
If it's still faster than the regular exponential function and it turns out that you're not vectorizing anything, then I would go for it! (in this scenario, your benchmark is essentially showing a lower-limit on the speed of your approx)
But if you benchmark actually includes vectorization, then I might have concerns about how representative the benchmark actually is. And you might want might want to be a little more rigorous.
- In more detail, the fact that you are using a range as inputs might also be skewing the results (since the actual inputs in the codebase probably aren't ranges, right?)
- The alignment of memory-allocations is also important. At least on certain architectures, SIMD doesn't work or is much slower for unaligned data. The fact that you are using ranges may optimize the benchmarked code to achieve optimal memory layout in your benchmarks that won't necessarily be achieved in real code.

My memory is a little fuzzy, but based on my past experience (with older Julia versions), I think I needed to provide more hints to convince the optimizer to auto-vectorize. I suspect your exp-approximation is probably not vectorized (which again is very encouraging if it's faster than the normal exponential -- since you could view it as a lower performance bound).

(For posterity, because I had to google this: essentially everything is little-endian nowadays.)

For further posterity, I know PowerPC was big-endian... IBM Power9 is a related architecture (I'm not sure of it's endianness) that's still around and used in clusters (like Summit), and I've actually used it quite a bit in this past year. Plus I think certain architectures can be configured to work with either endianness or mixed endianness?

I have to apologize, I had a dumb simultaneous thinko and typo that slightly derailed this. The 709.7 vs 702.9 thing was just me being bad at typing, and the question was really about squaring both sides of the inequality. My mind was so fixed on nonnegative physical quantities that I forgot about the possibility of negative numbers for a moment.

Regarding the comment saying that the function will return NaN when out of bounds: I'm not sure and will change it to say that it's undefined.

In the voigt profile context, the argument to $exp$ is guaranteed always to be in the range $(-25, 0]$. In the RT context, it can in in principle be an arbitrarily large nonnegative number. However, I think it may be possible to efficiently ensure that it as always less than some reasonable value. Cutting off RT calculations at $\tau=10$ or so will probably save enough operations to be worth potentially screwing up vectorization in some places.

Regarding the comment saying that the function will return NaN when out of bounds: I'm not sure and will change it to say that it's undefined.

Yeah, having just now spent a few minutes looking at the wikipedia article about double-precision representation, I have a slightly better appreciation. For posterity:

when x > 709.7, the behavior depends a little on just how much bigger x is.
- For the sake of argument, let transform(x) = COEFF_A * (x - correction(x-floor(x))) + COEF_B
- If transform(x) < 2^63, then a 64-bit signed integer can always represent the nearest integral value of transform(x) (i.e. unsafe_trunc has well defined behavior) and I believe that the output is always a NaN
- if transform(x) >= 2^63, unsafe_trunc returns an arbitrary value and all bets are off.
when x < -709.7, I think you might end up with 0. But, I'm not quite sure.

In the voigt profile context, the argument to exp is guaranteed always to be in the range (−25,0]. In the RT context, it can in in principle be an arbitrarily large nonnegative number. However, I think it may be possible to efficiently ensure that it as always less than some reasonable value. Cutting off RT calculations at τ=10 or so will probably save enough operations to be worth potentially screwing up vectorization in some places.

Yeah, if you have hard guarantees about the arguments falling into a certain range, then by all means, drop the bounds-check. (I would just be very explicit about it in comments)

ajwheeler / Korg.jl

Faster `voigt`, `exp` #80