JuliaMath / DoubleFloats.jl

math with more good bits
MIT License
153 stars 33 forks source link

`exp` and `log` are much slower than they should be. #135

Open oscardssmith opened 2 years ago

oscardssmith commented 2 years ago

both are around 100x slower than float64. exp in particular should be able to be done relatively quickly since exp(hi+lo)=exp(hi)exp(lo), and since hi can easily be reduced to [1,2), exp(lo) only needs 2 terms.

I have fewer ideas for log, but there should be something better than the current behavior.

also, it's worth figuring on double64 versions of these functions because the double32 versions can be done really quickly using float64.

JeffreySarnoff commented 2 years ago

Yes .. the Double32 type was an early artifact of hoping to redouble Double64s to compute Double128 accurately. At the time, Julia did not provide the mechanism for that sort involutve constructor to just work [or to work at all]. A next large reorganization of DoubleFloats just drops Double32, and just as you say .. recommends the use of Float64 has the doubled Float32 precision.

I had not worked with exp and log in quite a while. The low performance is unacceptable .. and was not always the case ... thank you for bringing this to light.

JeffreySarnoff commented 2 years ago

There are at least a few approaches to implementing exp and log. One gives each their own algorithmic specialization and whatever specific computational advantages are to be had. One works hard to speed exp for its own sake and uses it within log to refine the result. At some point I replaced one of those approaches with the other -- and the accuracy suffered some. To undo the revisions exactly. reduplicating the earlier implementations, was an ask too murky for me at the time (my comfort with GitHub has increased slowly).

oscardssmith commented 2 years ago

How much error are you willing to accept for an exp implementation?

JeffreySarnoff commented 2 years ago

If you sample deeply through all of the arithmetic and each elementary function, and then do the same with other implementations of double-double, we get at least as many and often more accurate bits in our results. exp is or may be used in constructing other functions that are defined in terms of exp and other things. So, there is less leeway with the arithmetic ops and exp and some others. I like that we do Float106 in a way that better protects the accuracy of the lower order bits [not always the lowest order bit].

oscardssmith commented 2 years ago

How would you feel about 4 ulps (or so) in the low part of the output? I think that will be roughly the threshold that is achievable without significant usage of triple compensated arithmetic.

JeffreySarnoff commented 2 years ago

I need to find the version of that number I had decided upon some time ago. I do recall there being some threshold of cooperation throughout the functional realizations. OTOH, rather than go full triple-double inside exp, using triple-doubles for the first two-sh coeffs may be the medium.

oscardssmith commented 2 years ago

How does this look? It's still missing the overflow/underflow checks, but I'm seeing this as benching roughly 10x faster than the current implementation, and I think the error should be acceptable (I haven't fully tested it though)

using DoubleFloats, ProfileView, Plots
setprecision(BigFloat, 2048)

@inline function two_sum(a, b)
    hi = a + b
    a1 = hi - b
    lo = (a - a1) + (b - (hi - a1))
    return hi, lo
end

@inline function exthorner(x, p::Tuple)
    hi, lo = p[end], zero(x)
    for i in length(p)-1:-1:1
        pi = p[i]
        prod = hi*x
        err1 = fma(hi, x, -prod)
        hi, err2 = two_sum(pi,prod)
        lo = fma(lo, x, err1 + err2)
    end
    return hi, lo
end

const coefs = Tuple(log(big(2))^n/factorial(big(n)) for n in 1:10)
const coefs_hi =  Float64.(coefs)
const coefs_lo = Float64.(coefs .- coefs_hi)

function exp_kernel(x::Float64)
    hi, lo = exthorner(x, coefs_hi)
    lo2 = evalpoly(x, coefs_lo)
    #lo2 = Float64(evalpoly(x, coefs.-coefs_hi))
    hix = hi*x
    lo_all = fma(lo, x, fma(lo2, x, fma(hi, x, -hix)))
    return Double64(hix, lo_all)
end

function make_table(size, n=1)
    t_array = zeros(Double64, 16);
    for j in 1:size
        val = 2.0^(BigFloat(j-1)/(16*n))
        t_array[j] = val
    end
    return Tuple(t_array)
end
const T1 = make_table(16)
const T2 = make_table(16, 16)

function double_exp2(x::Float64)
    N = round(Int, 256*x)
    k = N>>8
    j1 = T1[(N&255)>>4 + 1]
    j2 = T2[N&15 + 1]
    r = exp_kernel(muladd((-1/256), N, x))
    return r * j1 * j2 * exp2(k)
end

function double_exp2(a::Double64)
    x = a.hi
    N = round(Int, 256*x)
    k = N>>8
    j1 = T1[(N&255)>>4 + 1]
    j2 = T2[N&15 + 1]
    r = fma((-1/256), N, x)
    poly = exp_kernel(r)
    e2k = exp2(k)
    poly_lo = exp_kernel(a.lo)
    lo_part = fma(poly, poly_lo, poly_lo) + poly
    ans = fma(j1*j2, lo_part, j1*j2)
    return e2k*ans
end

Edit: This still has a bug somewhere that is causing it to not give nearly as many sig figs as it should.

JeffreySarnoff commented 2 years ago

It looks good, much more useful than was available an hour ago! I need to write a harness of more focused test regions.

oscardssmith commented 2 years ago

I just updated the code. I'm seeing about half the error of the current implementation with 10x speed improvement. Now I just need to make a PR and add back the overflow/underflow checks.

JeffreySarnoff commented 2 years ago

super good, smiling bits

On Thu, Jan 20, 2022 at 7:27 PM Oscar Smith @.***> wrote:

I just updated the code. I'm seeing about half the error of the current implementation with 10x speed improvement. Now I just need to make a PR and add back the overflow/underflow checks.

— Reply to this email directly, view it on GitHub https://github.com/JuliaMath/DoubleFloats.jl/issues/135#issuecomment-1018046373, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRVYFUUC4JUCHYDVUODUXCR7NANCNFSM5ML6RYPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

JeffreySarnoff commented 2 years ago

let me know when you want to merge something

oscardssmith commented 2 years ago

Can you merge https://github.com/oscardssmith/DoubleFloats.jl/tree/oscardssmith-faster-square? it's a smaller change, but should be ready to merge. this PR needs a small amount of extra work to avoid an accuracy regression (unless you are OK with a minor accuracy regression).

JeffreySarnoff commented 2 years ago

merged the faster-square strongly prefer maintaining accuracy is the regression confined to some lesser used region? Please give me a few examples that regress accuracy.

JeffreySarnoff commented 2 years ago

I just ran some rand * scale vectors of Double64 and Float128 through exp and log. Double64 outperforms Float128 by 2x .. 3.5x, that is rather impressive, considering all the person-years of development and all the more person-years of use and reporting that libquadmath has going for it.

@oscardssmith is there some corner of the domain that you found troublesome, or did merging your earlier improvement work quite well?

oscardssmith commented 2 years ago

what do you mean?

JeffreySarnoff commented 2 years ago

The title on this issue seems outdated. Before changing it, I am confirming with you that the performance of exp and log have improved over these past months. It appears they have.

oscardssmith commented 2 years ago

how have they changed? the source likes identical

JeffreySarnoff commented 2 years ago

Limited benchmarking does not report slow performance relative to Float128, allowing some proportional time adjustment for the 7 extra sigbits does not appear to matter.

I used 1024 rands, multiplied them by 2, subtracted 1 to have both positive and negative values. These signed floats were scaled, given them these extrema (34.8, -34.9). A second test vector was the first one translated for log, the second vector's extrema: (0.17, 69.9).

julia> (@btime map(exp,xs) setup=(xs=float128));
  1.513 ms (1 allocation: 16.12 KiB)

julia> (@btime map(exp,xs) setup=(xs=double64));
  326.600 μs (1 allocation: 16.12 KiB)

julia> (@btime map(log,xs) setup=(xs=float128_nonneg));
  665.800 μs (1 allocation: 16.12 KiB)

julia> (@btime map(log,xs) setup=(xs=double64_nonneg));
  388.900 μs (1 allocation: 16.12 KiB)
oscardssmith commented 2 years ago

yeah, it's not slow compared to float128, but that still is roughly 100x slower than Float64.

photor commented 8 months ago

yeah, it's not slow compared to float128, but that still is roughly 100x slower than Float64.

Any progress since then?

oscardssmith commented 8 months ago

no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.

photor commented 8 months ago

no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.

Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp.

JeffreySarnoff commented 8 months ago

This not an easy fix, the time taken is a result of optimizing for accuracy over speed. There is a less accurate implementation (as I recall) that is faster. I could add that as exp_unsafe -- let me see about tracking it down and benchmarking that.

On Sun, Mar 17, 2024 at 10:58 PM photor @.***> wrote:

no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.

Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp.

— Reply to this email directly, view it on GitHub https://github.com/JuliaMath/DoubleFloats.jl/issues/135#issuecomment-2002786214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRQEZ63ZL3F5NRIPFS3YYZJ7DAVCNFSM5ML6RYP2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGI3TQNRSGE2A . You are receiving this because you commented.Message ID: @.***>

JeffreySarnoff commented 8 months ago

I just spent time on this -- the current exp benchmarks 10x slower than Float64 -- no longer 100x slower. @oscardssmith @photor

julia> using DoubleFloats

julia> x = log(2.0)
0.6931471805599453

julia> dx = log(Double64(2.0))
6.93147180559945309417232121458179079e-01

julia> @b exp(x)
35.361 ns (2.01 allocs: 32.264 bytes)

julia> @b exp(dx)
359.211 ns (2.11 allocs: 67.158 bytes)

julia> x = log(37.0)
3.6109179126442243

julia> dx = log(Double64(37.0))
3.61091791264422444436809567103144955

julia> @b exp(x)
34.032 ns (2.01 allocs: 32.256 bytes)

julia> @b exp(dx)
374.667 ns (2.11 allocs: 67.200 bytes)

Julia Version 1.10.2 Commit bd47eca2c8 (2024-03-01 10:14 UTC)

oscardssmith commented 8 months ago

What computer are you on? 35ns for Float64 exp is way too slow. I get ~5ns on an 11th gen laptop on battery.

julia> @btime exp($(Ref(x)[]))
  4.885 ns (0 allocations: 0 bytes)
2.0
JeffreySarnoff commented 8 months ago

Thanks for that. It seems that using Chairmarks for this messed up my results. Back to the mill.

photor commented 8 months ago

This not an easy fix, the time taken is a result of optimizing for accuracy over speed. There is a less accurate implementation (as I recall) that is faster. I could add that as exp_unsafe -- let me see about tracking it down and benchmarking that. On Sun, Mar 17, 2024 at 10:58 PM photor @.> wrote: no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3. Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp. — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRQEZ63ZL3F5NRIPFS3YYZJ7DAVCNFSM5ML6RYP2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGI3TQNRSGE2A . You are receiving this because you commented.Message ID: @.>

Any chance to make it work on GPU?

julia> using DoubleFloats, Adapt, CUDA

julia> a=randn(Double64,1000,1000);

julia> gpua=adapt(CuArray, a);

julia> gpub=similar(gpua);

julia> function gpu_exp(y,x)
       for i=1:length(y)
       @inbounds y[i]=exp(x[i])
       end
       return nothing
       end
gpu_exp (generic function with 1 method)

julia> @cuda gpu_exp(gpub,gpua);
ERROR: InvalidIRError: compiling MethodInstance for gpu_exp(::CuDeviceMatrix{Double64, 1}, ::CuDeviceMatrix{Double64, 1}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to .text)
Stacktrace:
 [1] modf
   @ .\math.jl:902
 [2] modf
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\prearith\prearith.jl:170
 [3] calc_exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:142
 [4] exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:21
 [5] gpu_exp
   @ .\REPL[7]:3
Reason: unsupported call through a literal pointer (call to .text)
Stacktrace:
 [1] modf
   @ .\math.jl:902
 [2] modf
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\prearith\prearith.jl:171
 [3] calc_exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:142
 [4] exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:21
 [5] gpu_exp
   @ .\REPL[7]:3
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
 [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
   @ GPUCompiler C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\validation.jl:149
 [2] macro expansion
   @ C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\driver.jl:415 [inlined]
 [3] macro expansion
   @ C:\Users\DELL\.julia\packages\TimerOutputs\RsWnF\src\TimerOutput.jl:253 [inlined]
 [4] macro expansion
   @ C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\driver.jl:414 [inlined]
 [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool)
   @ GPUCompiler C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\utils.jl:89

julia>