Open oscardssmith opened 2 years ago
Yes .. the Double32 type was an early artifact of hoping to redouble Double64s to compute Double128 accurately. At the time, Julia did not provide the mechanism for that sort involutve constructor to just work [or to work at all]. A next large reorganization of DoubleFloats just drops Double32, and just as you say .. recommends the use of Float64 has the doubled Float32 precision.
I had not worked with exp
and log
in quite a while. The low performance is unacceptable .. and was not always the case ... thank you for bringing this to light.
There are at least a few approaches to implementing exp
and log
. One gives each their own algorithmic specialization and whatever specific computational advantages are to be had. One works hard to speed exp
for its own sake and uses it within log
to refine the result. At some point I replaced one of those approaches with the other -- and the accuracy suffered some. To undo the revisions exactly. reduplicating the earlier implementations, was an ask too murky for me at the time (my comfort with GitHub has increased slowly).
How much error are you willing to accept for an exp
implementation?
If you sample deeply through all of the arithmetic and each elementary function, and then do the same with other implementations of double-double, we get at least as many and often more accurate bits in our results.
exp
is or may be used in constructing other functions that are defined in terms of exp
and other things. So, there is less leeway with the arithmetic ops and exp
and some others. I like that we do Float106 in a way that better protects the accuracy of the lower order bits [not always the lowest order bit].
How would you feel about 4 ulps (or so) in the low part of the output? I think that will be roughly the threshold that is achievable without significant usage of triple compensated arithmetic.
I need to find the version of that number I had decided upon some time ago. I do recall there being some threshold of cooperation throughout the functional realizations. OTOH, rather than go full triple-double inside exp
, using triple-doubles for the first two-sh coeffs may be the medium.
How does this look? It's still missing the overflow/underflow checks, but I'm seeing this as benching roughly 10x faster than the current implementation, and I think the error should be acceptable (I haven't fully tested it though)
using DoubleFloats, ProfileView, Plots
setprecision(BigFloat, 2048)
@inline function two_sum(a, b)
hi = a + b
a1 = hi - b
lo = (a - a1) + (b - (hi - a1))
return hi, lo
end
@inline function exthorner(x, p::Tuple)
hi, lo = p[end], zero(x)
for i in length(p)-1:-1:1
pi = p[i]
prod = hi*x
err1 = fma(hi, x, -prod)
hi, err2 = two_sum(pi,prod)
lo = fma(lo, x, err1 + err2)
end
return hi, lo
end
const coefs = Tuple(log(big(2))^n/factorial(big(n)) for n in 1:10)
const coefs_hi = Float64.(coefs)
const coefs_lo = Float64.(coefs .- coefs_hi)
function exp_kernel(x::Float64)
hi, lo = exthorner(x, coefs_hi)
lo2 = evalpoly(x, coefs_lo)
#lo2 = Float64(evalpoly(x, coefs.-coefs_hi))
hix = hi*x
lo_all = fma(lo, x, fma(lo2, x, fma(hi, x, -hix)))
return Double64(hix, lo_all)
end
function make_table(size, n=1)
t_array = zeros(Double64, 16);
for j in 1:size
val = 2.0^(BigFloat(j-1)/(16*n))
t_array[j] = val
end
return Tuple(t_array)
end
const T1 = make_table(16)
const T2 = make_table(16, 16)
function double_exp2(x::Float64)
N = round(Int, 256*x)
k = N>>8
j1 = T1[(N&255)>>4 + 1]
j2 = T2[N&15 + 1]
r = exp_kernel(muladd((-1/256), N, x))
return r * j1 * j2 * exp2(k)
end
function double_exp2(a::Double64)
x = a.hi
N = round(Int, 256*x)
k = N>>8
j1 = T1[(N&255)>>4 + 1]
j2 = T2[N&15 + 1]
r = fma((-1/256), N, x)
poly = exp_kernel(r)
e2k = exp2(k)
poly_lo = exp_kernel(a.lo)
lo_part = fma(poly, poly_lo, poly_lo) + poly
ans = fma(j1*j2, lo_part, j1*j2)
return e2k*ans
end
Edit: This still has a bug somewhere that is causing it to not give nearly as many sig figs as it should.
It looks good, much more useful than was available an hour ago! I need to write a harness of more focused test regions.
I just updated the code. I'm seeing about half the error of the current implementation with 10x speed improvement. Now I just need to make a PR and add back the overflow/underflow checks.
super good, smiling bits
On Thu, Jan 20, 2022 at 7:27 PM Oscar Smith @.***> wrote:
I just updated the code. I'm seeing about half the error of the current implementation with 10x speed improvement. Now I just need to make a PR and add back the overflow/underflow checks.
— Reply to this email directly, view it on GitHub https://github.com/JuliaMath/DoubleFloats.jl/issues/135#issuecomment-1018046373, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRVYFUUC4JUCHYDVUODUXCR7NANCNFSM5ML6RYPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
let me know when you want to merge something
Can you merge https://github.com/oscardssmith/DoubleFloats.jl/tree/oscardssmith-faster-square? it's a smaller change, but should be ready to merge. this PR needs a small amount of extra work to avoid an accuracy regression (unless you are OK with a minor accuracy regression).
merged the faster-square strongly prefer maintaining accuracy is the regression confined to some lesser used region? Please give me a few examples that regress accuracy.
I just ran some rand * scale vectors of Double64 and Float128 through exp
and log
.
Double64 outperforms Float128 by 2x .. 3.5x, that is rather impressive, considering
all the person-years of development and all the more person-years of use and reporting that libquadmath has going for it.
@oscardssmith is there some corner of the domain that you found troublesome, or did merging your earlier improvement work quite well?
what do you mean?
The title on this issue seems outdated. Before changing it, I am confirming with you that the performance of exp and log have improved over these past months. It appears they have.
how have they changed? the source likes identical
Limited benchmarking does not report slow performance relative to Float128, allowing some proportional time adjustment for the 7 extra sigbits does not appear to matter.
I used 1024 rands, multiplied them by 2, subtracted 1 to have both positive and negative values.
These signed floats were scaled, given them these extrema (34.8, -34.9).
A second test vector was the first one translated for log
, the second vector's extrema: (0.17, 69.9).
julia> (@btime map(exp,xs) setup=(xs=float128));
1.513 ms (1 allocation: 16.12 KiB)
julia> (@btime map(exp,xs) setup=(xs=double64));
326.600 μs (1 allocation: 16.12 KiB)
julia> (@btime map(log,xs) setup=(xs=float128_nonneg));
665.800 μs (1 allocation: 16.12 KiB)
julia> (@btime map(log,xs) setup=(xs=double64_nonneg));
388.900 μs (1 allocation: 16.12 KiB)
yeah, it's not slow compared to float128, but that still is roughly 100x slower than Float64.
yeah, it's not slow compared to float128, but that still is roughly 100x slower than Float64.
Any progress since then?
no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.
no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.
Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp.
This not an easy fix, the time taken is a result of optimizing for accuracy
over speed.
There is a less accurate implementation (as I recall) that is faster. I
could add that as
exp_unsafe
-- let me see about tracking it down and benchmarking that.
On Sun, Mar 17, 2024 at 10:58 PM photor @.***> wrote:
no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.
Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp.
— Reply to this email directly, view it on GitHub https://github.com/JuliaMath/DoubleFloats.jl/issues/135#issuecomment-2002786214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRQEZ63ZL3F5NRIPFS3YYZJ7DAVCNFSM5ML6RYP2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGI3TQNRSGE2A . You are receiving this because you commented.Message ID: @.***>
I just spent time on this -- the current exp benchmarks 10x slower than Float64 -- no longer 100x slower. @oscardssmith @photor
julia> using DoubleFloats
julia> x = log(2.0)
0.6931471805599453
julia> dx = log(Double64(2.0))
6.93147180559945309417232121458179079e-01
julia> @b exp(x)
35.361 ns (2.01 allocs: 32.264 bytes)
julia> @b exp(dx)
359.211 ns (2.11 allocs: 67.158 bytes)
julia> x = log(37.0)
3.6109179126442243
julia> dx = log(Double64(37.0))
3.61091791264422444436809567103144955
julia> @b exp(x)
34.032 ns (2.01 allocs: 32.256 bytes)
julia> @b exp(dx)
374.667 ns (2.11 allocs: 67.200 bytes)
Julia Version 1.10.2 Commit bd47eca2c8 (2024-03-01 10:14 UTC)
What computer are you on? 35ns for Float64 exp
is way too slow. I get ~5ns on an 11th gen laptop on battery.
julia> @btime exp($(Ref(x)[]))
4.885 ns (0 allocations: 0 bytes)
2.0
Thanks for that. It seems that using Chairmarks for this messed up my results. Back to the mill.
This not an easy fix, the time taken is a result of optimizing for accuracy over speed. There is a less accurate implementation (as I recall) that is faster. I could add that as
exp_unsafe
-- let me see about tracking it down and benchmarking that. … On Sun, Mar 17, 2024 at 10:58 PM photor @.> wrote: no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3. Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp. — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRQEZ63ZL3F5NRIPFS3YYZJ7DAVCNFSM5ML6RYP2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGI3TQNRSGE2A . You are receiving this because you commented.Message ID: @.>
Any chance to make it work on GPU?
julia> using DoubleFloats, Adapt, CUDA
julia> a=randn(Double64,1000,1000);
julia> gpua=adapt(CuArray, a);
julia> gpub=similar(gpua);
julia> function gpu_exp(y,x)
for i=1:length(y)
@inbounds y[i]=exp(x[i])
end
return nothing
end
gpu_exp (generic function with 1 method)
julia> @cuda gpu_exp(gpub,gpua);
ERROR: InvalidIRError: compiling MethodInstance for gpu_exp(::CuDeviceMatrix{Double64, 1}, ::CuDeviceMatrix{Double64, 1}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to .text)
Stacktrace:
[1] modf
@ .\math.jl:902
[2] modf
@ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\prearith\prearith.jl:170
[3] calc_exp
@ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:142
[4] exp
@ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:21
[5] gpu_exp
@ .\REPL[7]:3
Reason: unsupported call through a literal pointer (call to .text)
Stacktrace:
[1] modf
@ .\math.jl:902
[2] modf
@ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\prearith\prearith.jl:171
[3] calc_exp
@ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:142
[4] exp
@ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:21
[5] gpu_exp
@ .\REPL[7]:3
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
@ GPUCompiler C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\validation.jl:149
[2] macro expansion
@ C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\driver.jl:415 [inlined]
[3] macro expansion
@ C:\Users\DELL\.julia\packages\TimerOutputs\RsWnF\src\TimerOutput.jl:253 [inlined]
[4] macro expansion
@ C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\driver.jl:414 [inlined]
[5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool)
@ GPUCompiler C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\utils.jl:89
julia>
both are around 100x slower than float64.
exp
in particular should be able to be done relatively quickly sinceexp(hi+lo)=exp(hi)exp(lo)
, and sincehi
can easily be reduced to [1,2),exp(lo)
only needs 2 terms.I have fewer ideas for
log
, but there should be something better than the current behavior.also, it's worth figuring on double64 versions of these functions because the double32 versions can be done really quickly using float64.