Open stevengj opened 8 years ago
Will the new algorithm still be optimum when used in a vectorized loop?
This may be relevant for the new libm.jl work. Cc @musm
@yuyichao, probably? Even if the compiler inlines a special-function call like erfinv
, I'm skeptical that compilers are powerful enough to vectorize the polynomial evaluation. e.g. I don't see any evidence of this in:
julia> function foo(a)
s = zero(eltype(a))
@simd for i in eachindex(a)
s += @inbounds @evalpoly(a[i], 2.3, 4.2, 1.6, 8.9, 0.8)
end
return s
end
foo (generic function with 1 method)
julia> @code_native foo(rand(10))
if I'm reading the asm output correctly.
I'm not worrying about inlining special functions that use @evalpoly
, I'm worrying about the case where @evalpoly
is used directly in the loop.
The code you posted has two problems.
It's accumulating nothing
.
Until https://github.com/JuliaLang/julia/pull/15558 is merged you need to put the @inbounds
on the loop or before the assignment.
It does seem that the loop body is somehow too complicated for llvm to vectorize in this case (might worth investigating) However, if you are doing purely elementwise operations, a very valid use case IMHO and is even more relavant now with the broadcast syntax, i.e.
julia> function foo2(a)
s = zero(eltype(a))
@simd for i in eachindex(a)
@inbounds a[i] = @evalpoly(a[i], 2.3, 4.2, 1.6, 8.9, 0.8)
end
return s
end
Then LLVM vectorizes it without any problem (3.8.1 at least)
P.S. I strongly recommend checking code_llvm
even for people who know assembly well. It's much easier to notice the problem with @inbounds
in the llvm ir.
Ah, thanks. Even so, it might be nice to have a SIMD variant of @evalpoly
for use in special functions.
for use in special functions.
Or cases not meant to/possible to be vectorized in a loop in general, I agree. It would be nice if the new algorithm won't regret in the vectorize loop case but I guess the answer is that it might?
Estrin's method ... I got bored in class~but the following is already more efficient than ~@horner
Only works for odd polynomials for now. I don't think simd is being attached however, could anyone confirm?
#only for odd polynomials
#c in descending order
using BenchmarkTools
x = 2.0
c = collect(Float64, 8:-1:1)
function estrin(x,c)
n = length(c)
m = n÷2
ap = c
a = similar(c)
X = x
while n > 1
@simd for i = 1:m
@inbounds a[m+1-i] = muladd(ap[n+1-(2i-1)], X, ap[n+1-(2i-1)-1])
end
ap[:] = a[:]
X = X*X
n = m
m = m÷2
end
return ap[1]
end
f(x,c) = (@evalpoly x c[1] c[2] c[3] c[4] c[5] c[6] c[7] c[8])
@benchmark estrin(x,c)
@benchmark f(x,c)
There are a lot of other opportunities for parallelism for simple polynomial multiplication, though maybe simpler is better; for e.g. all the X*X for the various powers can also be computed in parallel.
What are you trying to achieve with globals?
My bad, fixed, thanks!
There's also unnecessary copy/allocation. ap
should be updated in place.
Or even better, IIUC, you can just swap the two buffers.
I don't think you can get vectorization with non-unit stride (there's trick to kind of get it working though).
I just realize this is the evalpoly issue (I thought this is one of the special/libm function ones......) in this case any allocation is already a huge loose. @evalpoly
generates an expression at parse time and the replacement should do that too.
would
@inbounds ap[1:m] = a[1:m]
swap the buffers? or am I misunderstanding what you are saying ?
ap, a = a, ap
ap[1:m] = a[1:m]
still allocating a new array and copying the data. ap[:] = a
won't allocate, but will still copy.
@musm, rather than being "more efficient", in a quick benchmark I find your estrin
function to be about 50x slower than the inlined polynomial evaluation of @evalpoly
for this degree-7 polynomial. This is not surprising, because the difference between inlined polynomial evaluation and any kind of runtime loop is vast (in addition to things like inadvertent array allocation that handicap your implementation).
This issue is about improving the inlined @evalpoly
; let's not side-track it into a discussion of how to optimize some runtime-loop code.
apologies
No worries @musm, I just wanted to nip the digression in the bud.
On the topic of improving @evalpoly
However, it seems likely to be faster to try and exploit SIMD instructions for these polynomial evaluations, e.g. with Estrin's algorithm or some other technique.
I have a decent implementation of Estrin's algorithm that's a regular function (for odd polynomials, still). I have two versions: one without simd which is 10x slower than @horner and one that takes advantage of simd which is 40x slower, based on benchmarks.
EDIT
It's not exactly clear how one could turn this algorithm to a macro that would expand to avoid runtime loops and take advantage of simd operations. If I understand things, would it be possible to expand the macro to something like the following pseudo code yes it is and I have the code for this
evaluate : c8 x^7 + c7 x^6 + ... + c1
@simd begin
a1 = c2*x + c1
a2 = c4*x + c3
a3 = c6*x + c5
a4 = c8*x + c7
end
x = x*x
@simd begin
a1 = a2*x + a1
a2 = a4*x + a3
end
x = x*x
@simd begin
a1 = a2*x + a1
end
return a1
@simd
used this way is a no-op. The vectorization here requires writing it in a way that can be recognized by LLVM passes we add with -O3
or explicitly using VecElement
.
The SIMD package provides the respective operations, based on Base.VecElement
:
using SIMD
function evalpoly{T}(c1,c2,c3,c4,c5,c6,c7,c8, x::T)
c2468 = Vec{4,T}((c2,c4,c6,c8))
c1357 = Vec{4,T}((c1,c3,c5,c7))
a1234 = c2468 * x + c1357
x = x^2
a24 = Vec{2,T}((a1234[2], a1234[4]))
a13 = Vec{2,T}((a1234[1], a1234[3]))
a12 = a24 * x + a13
x = x^2
a2 = Vec{1,T}((a12[2],))
a1 = Vec{1,T}((a12[1],))
r = a2 * x + a1
r[1]
end
evalpoly(1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0, 0.5)
@code_native evalpoly(1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0, 0.5)
If you look at the generated code, you'll see immediately why this turns out to be very slow:
julia> @code_native evalpoly(1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0, 0.5)
.section __TEXT,__text,regular,pure_instructions
Filename: REPL[20]
pushq %rbp
movq %rsp, %rbp
vmovsd 16(%rbp), %xmm8 ## xmm8 = mem[0],zero
Source line: 121
vunpcklpd %xmm7, %xmm5, %xmm5 ## xmm5 = xmm5[0],xmm7[0]
vunpcklpd %xmm3, %xmm1, %xmm1 ## xmm1 = xmm1[0],xmm3[0]
vinsertf128 $1, %xmm5, %ymm1, %ymm1
Source line: 121
vunpcklpd %xmm6, %xmm4, %xmm3 ## xmm3 = xmm4[0],xmm6[0]
vunpcklpd %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0],xmm2[0]
vinsertf128 $1, %xmm3, %ymm0, %ymm0
Source line: 114
vbroadcastsd %xmm8, %ymm2
Source line: 567
vmulpd %ymm2, %ymm1, %ymm1
Source line: 567
vaddpd %ymm1, %ymm0, %ymm0
Source line: 5
vmulsd %xmm8, %xmm8, %xmm1
Source line: 121
vextractf128 $1, %ymm0, %xmm2
Source line: 121
vunpckhpd %xmm2, %xmm0, %xmm3 ## xmm3 = xmm0[1],xmm2[1]
Source line: 121
vunpcklpd %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0],xmm2[0]
Source line: 114
vmovddup %xmm1, %xmm2 ## xmm2 = xmm1[0,0]
Source line: 567
vmulpd %xmm3, %xmm2, %xmm2
Source line: 567
vaddpd %xmm2, %xmm0, %xmm0
Source line: 9
vmulsd %xmm1, %xmm1, %xmm1
Source line: 121
vpermilpd $1, %xmm0, %xmm2 ## xmm2 = xmm0[1,0]
Source line: 567
vmulsd %xmm2, %xmm1, %xmm1
Source line: 567
vaddsd %xmm1, %xmm0, %xmm0
Source line: 13
popq %rbp
vzeroupper
retq
There are fewer actual arithmetic operations (vmul
, vadd
), but there are many additional operations to move data between vector lanes.
The first six operations (vunpcklpd
and vinsertf128
) are merely setting up the constants c1
...c8
; these would normally be known at compile time, making those instructions superfluous. The other vextract
, vunpck
, vmov
, and vperm
instructions remain necessary.
Thanks @eschnett I created a macro that expands out the operation's in Estrin's algorithm to explore possible benefits of vectorization for polynomial evaluation using your SIMD package. See: https://github.com/musm/Estrin.jl (I might rename it later)
@musm Nice!
The expensive part is currently switching between different vector sizes. The following might lead to a faster algorithm:
x
also in a vector register, so that it isn't cast from scalar to vector multiple times; this means squaring it as vector
The goal would be to get things down to a single vector permute operation in between the different iterations.However, if this evalpoly
is called inside a loop, then I expect things to be much faster if evalpoly
is kept scalar, and the loop is vectorized.
@eschnett I'll try to take those points into consideration in an improved implementation.
Another very simple option for parallelism in @polyeval
is to do an even odd split of the polynomial at the cost of one additional multiplication
using BenchmarkTools
using Base.Math.@horner
c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10 = (1:11...)
x = 2.0
function f1(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
x2 = x*x
begin # ideally we should be able to tell the compiler to run these two in parallel, is threads the right approach here? I.e. run these two on separate threads
cc1 = @horner x2 c1 c3 c5 c7 c9 # the odds
cc2 = @horner x2 c2 c4 c6 c8 c10 # the evens
end
c0 + x*cc1 + x2 * cc2
end
function f2(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
@horner x c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
end
g1(x) = f1(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
g2(x) = f2(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
xx = 5.0*randn(1_000_000)
@show @benchmark g1.(xx)
@show @benchmark g2.(xx)
On my computer with -O3 this gives
julia> @show @benchmark f1(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
BenchmarkTools.Trial:
samples: 36
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 53.41 mb
allocs estimate: 3000055
minimum time: 136.63 ms (3.22% GC)
median time: 140.42 ms (3.49% GC)
mean time: 141.52 ms (3.78% GC)
maximum time: 155.14 ms (4.54% GC)
julia> @show @benchmark f2(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
BenchmarkTools.Trial:
samples: 13
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 328.07 mb
allocs estimate: 21000055
minimum time: 391.01 ms (7.11% GC)
median time: 399.62 ms (7.22% GC)
mean time: 415.16 ms (10.21% GC)
maximum time: 499.06 ms (24.32% GC)
This is about x3 faster.
Is the compiler automatically able to determine that the two horner evaluations can be run concurrently in this example?
Julia will not yet use multiple threads for concurrent evaluation, but perhaps this is benefiting due to better codegen.
It's very unlikely cheap operations like this will benefit from threading at all.
@musm The time difference seems to be due partially to much higher gc. Can you either call gc()
before each run, or do gc_disable()
and run the tests?
Also, I wonder if having all the variables in global scope affects this. Perhaps wrap everything in a function?
g1.($xx)
instead of g1.(xx)
should get rid of the global variable although I doubt it matters much.
And it's better to not use the boardcast form since as mentioned before it's the form that can be more easily loop vectorized with the old/scalar form. Using the broadcast from also means that you are sensitive to many unrelated factors including memory allocation, garbage collection and memory bandwidth.
although I doubt it matters much.
I take that back, I didn't realize that all the c*
s are in global scope. Yeah, that's terrible.....
If you see that you are allocating 300 MB in your benchmark that should just evaluate a polynomial then you should look closer in how you are benchmarking. The absolute timings you are getting are also unreasonable with some napkin math of cpu freq and number of ops required.
Because the coefficients are known at parse time in the macro, it should be able to decide on the optimal vector size at parse time and/or compile-time. There should be no runtime switching.
(There can be checks like isa(z, Float64)
, but my understanding is that this check ends up getting evaluated at compile-time and there is no runtime cost.)
(@musm, this kind of low-level performance hacking is probably not a good introductory issue, because it requires a lot of understanding of Julia internals.)
An even/odd splitting like this is likely to increase performance. Modern CPUs can execute instructions out-of-order, and given the number of execution ports they have, it is likely that both streams can be executed simultaneously.
Alternatively, you can execute the two streams simultaneously by using 2-element SIMD vectors. That's very similar to what I suggested earlier, except with a vector size of 2 instead of 4.
As others said, multi-threading will not help here. Threading introduces an overhead of about 1 us (roughly), whereas instructions are executed in fractions of a nanosecond. Thus you'll need tens of thousands of instructions to recoup the threading overhead.
Sorry about the lack of rigorous benchmarking, I should've known better, but It was late : P
To test this I implemented a macro that does an even odd split. You can do Pkg.clone("https://github.com/musm/ParPoly.jl.git")
using Benchmarks, PyPlot
using Base.Math.@horner, ParPoly
x = 2.0
nn = 4:25
t_horner_split = []
t_horner = []
xt_horner_split = []
xt_horner = []
t_horner_split_m = []
t_horner_m = []
for n in nn
println("running bench for order $(n-1)")
c = randn(Float64,n)
@gensym f1
@gensym f2
@eval begin
$f1(x) = @horner_split x $(c...)
$f2(x) = @horner x $(c...)
t_hs = @benchmark $f1(x)
t_h = @benchmark $f2(x)
end
push!(t_horner_split_m, Benchmarks.SummaryStatistics(t_hs).elapsed_time_center)
push!(t_horner_m, Benchmarks.SummaryStatistics(t_h).elapsed_time_center)
end
plot(nn-1, t_horner_split_m,label="horner_split")
plot(nn-1, t_horner_m,label="horner")
legend(loc="upper left")
xlabel("polynomial order")
ylabel("time (ns)")
Alternatively, you can execute the two streams simultaneously by using 2-element SIMD vectors. That's very similar to what I suggested earlier, except with a vector size of 2 instead of 4.
If I understand correctly do you mean something like the following?
x1 = 2.0
x2 = x1*x1
x1x2 = Vec{2,Float64}((x1,x2))
xx = Vec{2,Float64}((x2,x2))
cc1 = Vec{2,Float64}((c1,c2))
cc2 = Vec{2,Float64}((c3,c4))
cc3 = Vec{2,Float64}((c5,c6))
cc4 = Vec{2,Float64}((c7,c8))
cc5 = Vec{2,Float64}((c9,c10))
C = (cc1+xx*(cc2 + xx*(cc3 + xx*(cc4 + xx*cc5))))
val = c0 + sum(x1x2*C)
This would need to allocate a lot of new variables (not sure such a large number of allocation would outweigh the possible increase of execution speed) I would think that a more parallelized version would benefit stronger from such an approach a la estrin's scheme or similar.
Again, don't use the broadcast to benchmark, use the scalar form directly. Also those Vec
s don't allocate at all.
@benchmark
is good enough, don't write your own loop..................................................
I also believe that you should be seeing the function overwrite warning and you should not ignore it. You are most likely benchmarking only a single function which is why the scaling in your result doesn't make sense.
@yuyichao Does that look better? I do see them, the methods get overwritten, what's the problem then? Should probably be semilogy plot...
what's the problem then
The time the low order polynomials is way to long. You shouldn't get tens of ns for that unless you are using a really slow CPU. Try,
for n in nn
c = randn(Float64,n)
@gensym f1
@gensym f2
@eval begin
$f1(x) = @horner_split x $(c...)
$f2(x) = @horner x $(c...)
t_hs = @benchmark $f1($x)
t_h = @benchmark $f2($x)
end
push!(t_horner_split, median(t_hs).time)
push!(t_horner, median(t_h).time)
end
@musm, good investigation – benchmarking this kind of stuff is tricky and this is how you learn :)
It does seem that you might hit https://github.com/JuliaCI/BenchmarkTools.jl/issues/7, which is why I still use Benchmarks.jl
instead BenchmarkTools.jl
for small functions like this.....
Anecdotally using Benchmarks.jl, @horner_split_simd
edges out @horner
and @horner_split
especially for larger polynomial sizes. I can't however seem to get @benchmark
working in an @eval
using Benchmarks.jl (this is after I fixed a different bug in the Benchmarks.jl package )
EDIT:
I also implemented @horner_split_simd
macro, which does the horner_split method along the lines described above and eschnett's recommendations
code: https://github.com/musm/ParPoly.jl/blob/master/bench/bench.jl @yuyichao I know you recommended against rolling my own loops but this was the only way I could coerce BenchmarkTools to work reasonably well (borrowing the trick from simonbyrne)
I've been doing some more benchmarking and I see very good results with the simple @horner_split
, in various test cases it's usually 2x faster than the @horner
method in base. This can probably be attributed to the fact that the cpu will just execute the two even and odd lines out of order. Interestingly, the @horner_simd macro
, according to my tests, shows up as being consistently slower than the @horner_split method and the version in Base. Perhaps this can be attributed to the overhead in loading and unloading the vectors in the vector types?
In any case using the horner_split macro in the libm.jl package, I see very nice perf gains.
c.f. SIMD version
julia> @code_llvm g(x)
; Function Attrs: uwtable
define double @julia_g_66617(double) #0 {
top:
%1 = fmul double %0, %0
%2 = insertelement <2 x double> undef, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = insertelement <2 x double> undef, double %1, i32 0
%5 = insertelement <2 x double> %4, double %1, i32 1
%res.i = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> <double 1.400000e+01, double 1.400000e+01>, <2 x double> <double 1.200000e+01, double 1.300000e+01>)
%res.i.1 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i, <2 x double> <double 1.000000e+01, double 1.100000e+01>)
%res.i.2 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.1, <2 x double> <double 8.000000e+00, double 9.000000e+00>)
%res.i.3 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.2, <2 x double> <double 6.000000e+00, double 7.000000e+00>)
%res.i.4 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.3, <2 x double> <double 4.000000e+00, double 5.000000e+00>)
%res.i.5 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.4, <2 x double> <double 2.000000e+00, double 3.000000e+00>)
%res.i.6 = fmul <2 x double> %3, %res.i.5
%vec_1_1.i = shufflevector <2 x double> %res.i.6, <2 x double> undef, <1 x i32> zeroinitializer
%vec_1_2.i = shufflevector <2 x double> %res.i.6, <2 x double> undef, <1 x i32> <i32 1>
%vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i
%res.i.7 = extractelement <1 x double> %vec_1.i, i32 0
%6 = fadd double %res.i.7, 1.000000e+00
ret double %6
}
Serial version of the split macro
julia> @code_llvm f(x)
; Function Attrs: uwtable
define double @julia_f_66618(double) #0 {
top:
%1 = fmul double %0, %0
%2 = call double @llvm.fmuladd.f64(double %1, double 1.400000e+01, double 1.200000e+01)
%3 = call double @llvm.fmuladd.f64(double %1, double %2, double 1.000000e+01)
%4 = call double @llvm.fmuladd.f64(double %1, double %3, double 8.000000e+00)
%5 = call double @llvm.fmuladd.f64(double %1, double %4, double 6.000000e+00)
%6 = call double @llvm.fmuladd.f64(double %1, double %5, double 4.000000e+00)
%7 = call double @llvm.fmuladd.f64(double %1, double %6, double 2.000000e+00)
%8 = call double @llvm.fmuladd.f64(double %1, double 1.400000e+01, double 1.300000e+01)
%9 = call double @llvm.fmuladd.f64(double %1, double %8, double 1.100000e+01)
%10 = call double @llvm.fmuladd.f64(double %1, double %9, double 9.000000e+00)
%11 = call double @llvm.fmuladd.f64(double %1, double %10, double 7.000000e+00)
%12 = call double @llvm.fmuladd.f64(double %1, double %11, double 5.000000e+00)
%13 = call double @llvm.fmuladd.f64(double %1, double %12, double 3.000000e+00)
%14 = fmul double %7, %0
%15 = fadd double %14, 1.000000e+00
%16 = fmul double %1, %13
%17 = fadd double %15, %16
ret double %17
Seems like we should update the @horner
in Base with the new one. Thoughts from others?
Seems like we should update the
@horner
in Base with the new one. Thoughts from others?
Well, technically it would no longer be a Horner scheme (there is a chance that accuracy guarantees are predicated on it being evaluated in this manner)
I think Viral meant update @evalpoly
...
Sometime soon I'm going to update the Estrin scheme code (it's not efficient yet) and compare ( but I suspect the simple split scheme may still perform best)
For help with these sort of low-level optimisations, you might want to check out https://github.com/carnaval/IACA.jl
Edit: sensible benchmark results posted above. Looks like the overhead of setting up SIMD doesn't pay off until the polynomial is order 18 and I'm guessing very few people are evaluating polynomials greater than order 20... I'd suggest @horner for order less than 5, and @horner_split for the rest...
Did you try setting up more than 2 independent instruction streams? E.g. using 4 streams might be even faster, given that multiplication and addition has several cycles of latency.
Right now,
@evalpoly(x, coefs...)
uses Horner's rule for realx
, and a more complicated algorithm for complexx
, and is performance-critical for evaluation of special functions.However, it seems likely to be faster to try and exploit SIMD instructions for these polynomial evaluations, e.g. with Estrin's algorithm or some other technique.
(The nice thing about code generation / macros is that we can investigate fancy methods for this sort of thing, protected by an
isa(x,Float64)
check that will get optimized out at compile-time.)