ajwheeler / Korg.jl

fast 1D LTE stellar spectral synthesis
https://ajwheeler.github.io/Korg.jl/
BSD 3-Clause "New" or "Revised" License
34 stars 7 forks source link

Faster `voigt`, `exp` #80

Open mabruzzo opened 2 years ago

mabruzzo commented 2 years ago

As promised, I'm posting my faster exponential function. I'm curious to hear if using this inside of voigt buys you any speed.

Tonight, I also put together a branchless version of voigt. Unfortunately, all of the extra work that is being done by the branchless voigt makes it slower than the unvectorized version. I suspect that if someone spent a lot of time refactoring, they might be able to get it to be slightly faster. But I don't think that's worth the time. As you had suggested a while back, I think a better use of time would be to replace the current approximation with a better approach.

I did some googling and the consensus seems to be that we should leverage the relationship of the Voigt function with the real function. According to eqn 9 of Thompson (93), V(u,a) = Re[exp(z²) erfc(z)], where z = a + i*u. I suspect that nobody will have time to pursue this for a while, so I'll briefly document what I read. I've seen 2 ways to do this (there are probably more):

  1. I think that Turbospectrum's implementation is based on the same underlying approach as Kuntz (1997). (Interestingly, I think that turbospectrum and the article might be suggesting that for V(a,v), z = a - iv...). Under this approach, I think that they directly approximate the whole expression Re[exp(z²) erfc(z)], but it introduces at least some level of branching. I think this article talks a little about variants of this approach.

  2. At a glance, I suspect this might potentially be slightly faster to adopt a strategy like the one used in this article (if we target a particular accuracy). They seemed to use an approximation for exp(z²) erfc(z) ≈ (Σⱼᵖ aⱼ zʲ)/(z⁽ᵖ⁺¹⁾ + Σⱼᵖ bⱼ zʲ) with coefficients from here, and directly computed exp(z²). The benefit of this approach is that there is basically no-branching. However, plugging Re(exp(-1*(x + i*y)^2) * (a + i b)) into wolfram alpha suggests that you also need to evaluate sin and cos (which I think boils down to 2 more polynomial evaluations)...

If either of these approaches are adopted, I believe some care is required to avoid using Julia's complex datatype (since I think ForwardDiff.jl is incompatible with complex numbers)

mabruzzo commented 2 years ago

Here's the code for the branchless exponential function:

# in the file where you define your branchless exponential function,
# you want to make sure that the machine is little endian. The following
# code does that check for you...
const LITTLE_ENDIAN = (ENDIAN_BOM == 0x04030201)
@assert LITTLE_ENDIAN

# polynomial coefficients in correction were taken from 
# https://gitlab.mpcdf.mpg.de/bbramas/inastemp/-/blob/master/Src/Common/InaFastExp.hpp
# (That above repository has an MIT license that needs to be noted somewhere if this is used,
# unless you derive them yourself)
# you can adjust the order of the polynomials to adjust accuracy (see the above link for
# more options).
#
# Some random spot-checks seemed to indicate that the following polynomial is accurate to
# better than 1e-5. But my checks were NOT rigorous 
@inline function correction(x::F) where {F<:AbstractFloat}
    coef_1, coef_2, coef_3, coef_4, coef_5 = (-3.70138142771437266806e-06,
                                               3.07033820309224325662e-01,
                                              -2.41638288055762540107e-01,
                                              -5.16904731562965388814e-02,
                                              -1.36976563343097993558e-02)
    (((((coef_5 * x + coef_4) * x + coef_3) * x) + coef_2) * x) + coef_1
end

# this whole approach is based on
# https://www.researchgate.net/publication/272178514_Fast_Exponential_Computation_on_SIMD_Architectures
const COEFF_LOG2E = log2(ℯ)
const COEFF_A = 2^52
const COEFF_B = 2^52*1023

@inline my_ifelse(cond, a, b) = cond*a + (1-cond) * b

@inline function branchless_exp(x::Float64)
    # inrange and my_ifelse are present purely for error handling.
    # When x^2 > 709.7, they make the function returns -1
    # Note: This error handling makes the function ~50% slower
    # (but my benchmarks still suggest that it's ~3 times faster
    # than exp)

    inrange = (x^2 < 709.7^2)
    x = my_ifelse(inrange, x, 0) * COEFF_LOG2E
    fractional_part::Float64 = x - floor(x)

    x -= correction(fractional_part)
    casted_integer::Int64 = unsafe_trunc(Int64, COEFF_A * x + COEFF_B)
    my_ifelse(inrange, reinterpret(Float64, casted_integer),-1.0)
end

# keep in mind: to use this function with ForwardDiff, you'll have to 
# define a specialization of `branchless_exp` that takes a DualNumber.
# If you look at dual.jl, within ForwardDiff.jl, they do this for
# functions like `sincospi`

When you broadcast over a call to this function, Julia has a really easy time auto-vectorizing. It would take a little effort to write a version of this function for 32-bit floats

ajwheeler commented 1 month ago

$\exp(-\tau)$ is also a bottleneck in the radiative transfer calculations, which I have been looking at lately, so I'm returning to this now.

Dumb Q: why x^2 < 709.7^2 and not x < 702.9?

That gitlab repo is down, but I think this is the same: https://gitlab.inria.fr/bramas/inastemp/-/blob/master/Src/Common/InaFastExp.hpp

I think the level of precision is kinda on the border of being adequate. Since these values get used in integrals over the (~50) atmospheric layers, you could conceivably accumulate another factor of 100 in error. Below the fold is a version with an additional coefficient that is accurate to $10^{-7}$ (for the values used in the timing tests below). It's about 15-20% slower than you posted version, as expected.

```julia # in the file where you define your branchless exponential function, # you want to make sure that the machine is little endian. The following # code does that check for you. const LITTLE_ENDIAN = (ENDIAN_BOM == 0x04030201) @assert LITTLE_ENDIAN # polynomial coefficients in correction were taken from # https://gitlab.inria.fr/bramas/inastemp/-/blob/master/Src/Common/InaFastExp.hpp # That projects's license is included here: # Copyright (c) 2016 Berenger Bramas - MPCDF # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. @inline function correction(x::F) where {F<:AbstractFloat} #coef_1, coef_2, coef_3, coef_4, coef_5 = (-3.70138142771437266806e-06, # 3.07033820309224325662e-01, # -2.41638288055762540107e-01, # -5.16904731562965388814e-02, # -1.36976563343097993558e-02) #(((((coef_5 * x + coef_4) * x + coef_3) * x) + coef_2) * x) + coef_1 c1, c2, c3, c4, c5, c6 = (1.06823753710239477000e-07, 3.06845249656632845792e-01, -2.40139721982230797126e-01, -5.58662282412822480682e-02, -8.94283890931273951763e-03, -1.89646052380707734290e-03) ((((((c6 * x + c5) * x + c4) * x + c3) * x) + c2) * x) + coef_1 end # this whole approach is based on # https://www.researchgate.net/publication/272178514_Fast_Exponential_Computation_on_SIMD_Architectures const COEFF_LOG2E = log2(ℯ) const COEFF_A = 2^52 const COEFF_B = 2^52*1023 # this will return NaN when X > 702.9 @inline function branchless_exp(x::Float64) x *= COEFF_LOG2E fractional_part::Float64 = x - floor(x) x -= correction(fractional_part) casted_integer::Int64 = unsafe_trunc(Int64, COEFF_A * x + COEFF_B) reinterpret(Float64, casted_integer) end # TODO: method for duals ```

This is the more precise version with bounds-checking turned off. For the range of $\tau$ values that we get in the RT calculations, and in the voigt calculations. All errors are under 1e-7. This in on a macbook pro with an M2 pro with Julia 1.10.

ntaus = -(10 .^ (-5:0.0001:2)) # values typical for RT
@btime branchless_exp.($ntaus) # 51.125 μs (2 allocations: 547.05 KiB)
@btime exp.($ntaus).           # 205.333 μs (2 allocations: 547.05 KiB)
v2s = 0:0.0001:5             # values typical for voigt
@btime branchless_exp.($v2s) # 154.250 μs (2 allocations: 390.80 KiB)
@btime exp.($v2s)            # 197.917 μs (2 allocations: 390.80 KiB)

It seems like the gains are bigger for RT, but substantial for both. Am I missing any reason why we should not start using this?

(For posterity, because I had to google this: essentially everything is little-endian nowadays.)

mabruzzo commented 1 month ago

I thought this was going to be a quick response, but it turns out I have a lot of thoughts on the matter. If the tone sounds at all negative, my apologies -- that's definitely not my intent. I was just trying to get all my thoughts down on the page as quickly as possible them all down.

Additionally, when I started writing this up, I was writing it assuming that vectorized code was the ultimate goal. But as I've thought about this more, you probably won't achieve that unless you make a bunch of other changes (at least inside the voigt-profile). So in the short-term, we are probably focused on performance-improvements within non-vectorized code (which you may still achieve do to the removal of branches).

Let me also be clear, all of this is based on intuition. It's possible my intuition may be totally wrong (since it's been a while). Hopefully, my response is sounds somewhat coherent and is at least somewhat useful...

Dumb Q: why x^2 < 709.7^2 and not x < 702.9?

I have no idea why I proposed 709.7 instead of 702.9... If nothing else, my version should probably be corrected to x^2 < 702.9^2.

I think there are essentially 3 logically equivalent options for implementing the logic I was going after (assuming that there is no NaN values):

  1. (-702.9 < x) & (x< 702.9): this is composed of 2 comparison and a bitwise-and
  2. abs(x) < 702.9: this is composed of 1 comparison and 1 abs evaluation.
  3. x^2 < 702.9^2: because of Julia JIT magic this is really just x^2 < constant at runtime, which is composed 1 multiplication and 1 comparison.

Choosing between one of these choices is a micro-optimization and the optimal choice is fuzzy. I was probably considering vectorization at the time. In that context, the best choice may differ based on the precise CPU-architecture and it may depends on the effectiveness of the LLVM auto-vectorizer).

We can talk through this a little:

All of this is to say that I choose x^2 < 702.9^2 since it seemed like it would be most likely to work best on most architectures. But it's certainly possible that something else may be a little better (my intuition about all of this may be wrong).

If you knew that code-paths within the calling function ALWAYS ensured that the value was non-positive or non-negative, you could obviously get an additional speed-up.

This is the more precise version with bounds-checking turned off. For the range of values that we get in the RT calculations, and in the voigt calculations. All errors are under 1e-7. This in on a macbook pro with an M2 pro with Julia 1.10. ... It seems like the gains are bigger for RT, but substantial for both. Am I missing any reason why we should not start using this?

I would have 2 concerns: correctness and benchmarking accuracy.

Your comment says that this will return NaN when X > 702.9. How completely sure are you about that?

About the benchmarking itself:

My memory is a little fuzzy, but based on my past experience (with older Julia versions), I think I needed to provide more hints to convince the optimizer to auto-vectorize. I suspect your exp-approximation is probably not vectorized (which again is very encouraging if it's faster than the normal exponential -- since you could view it as a lower performance bound).

(For posterity, because I had to google this: essentially everything is little-endian nowadays.)

For further posterity, I know PowerPC was big-endian... IBM Power9 is a related architecture (I'm not sure of it's endianness) that's still around and used in clusters (like Summit), and I've actually used it quite a bit in this past year. Plus I think certain architectures can be configured to work with either endianness or mixed endianness?

ajwheeler commented 1 month ago

I have to apologize, I had a dumb simultaneous thinko and typo that slightly derailed this. The 709.7 vs 702.9 thing was just me being bad at typing, and the question was really about squaring both sides of the inequality. My mind was so fixed on nonnegative physical quantities that I forgot about the possibility of negative numbers for a moment.

ajwheeler commented 1 month ago

Regarding the comment saying that the function will return NaN when out of bounds: I'm not sure and will change it to say that it's undefined.

In the voigt profile context, the argument to $exp$ is guaranteed always to be in the range $(-25, 0]$. In the RT context, it can in in principle be an arbitrarily large nonnegative number. However, I think it may be possible to efficiently ensure that it as always less than some reasonable value. Cutting off RT calculations at $\tau=10$ or so will probably save enough operations to be worth potentially screwing up vectorization in some places.

mabruzzo commented 1 month ago

Regarding the comment saying that the function will return NaN when out of bounds: I'm not sure and will change it to say that it's undefined.

Yeah, having just now spent a few minutes looking at the wikipedia article about double-precision representation, I have a slightly better appreciation. For posterity:

In the voigt profile context, the argument to exp is guaranteed always to be in the range (−25,0]. In the RT context, it can in in principle be an arbitrarily large nonnegative number. However, I think it may be possible to efficiently ensure that it as always less than some reasonable value. Cutting off RT calculations at τ=10 or so will probably save enough operations to be worth potentially screwing up vectorization in some places.

Yeah, if you have hard guarantees about the arguments falling into a certain range, then by all means, drop the bounds-check. (I would just be very explicit about it in comments)