JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.44k stars 5.46k forks source link

optimize @evalpoly for SIMD? #18410

Open stevengj opened 8 years ago

stevengj commented 8 years ago

Right now, @evalpoly(x, coefs...) uses Horner's rule for real x, and a more complicated algorithm for complex x, and is performance-critical for evaluation of special functions.

However, it seems likely to be faster to try and exploit SIMD instructions for these polynomial evaluations, e.g. with Estrin's algorithm or some other technique.

(The nice thing about code generation / macros is that we can investigate fancy methods for this sort of thing, protected by an isa(x,Float64) check that will get optimized out at compile-time.)

yuyichao commented 8 years ago

Will the new algorithm still be optimum when used in a vectorized loop?

ViralBShah commented 8 years ago

This may be relevant for the new libm.jl work. Cc @musm

stevengj commented 8 years ago

@yuyichao, probably? Even if the compiler inlines a special-function call like erfinv, I'm skeptical that compilers are powerful enough to vectorize the polynomial evaluation. e.g. I don't see any evidence of this in:

julia> function foo(a)
          s = zero(eltype(a))
          @simd for i in eachindex(a)
              s += @inbounds @evalpoly(a[i], 2.3, 4.2, 1.6, 8.9, 0.8)
          end
          return s
       end
foo (generic function with 1 method)

julia> @code_native foo(rand(10))

if I'm reading the asm output correctly.

yuyichao commented 8 years ago

I'm not worrying about inlining special functions that use @evalpoly, I'm worrying about the case where @evalpoly is used directly in the loop.

The code you posted has two problems.

  1. It's accumulating nothing.

    Until https://github.com/JuliaLang/julia/pull/15558 is merged you need to put the @inbounds on the loop or before the assignment.

  2. It does seem that the loop body is somehow too complicated for llvm to vectorize in this case (might worth investigating) However, if you are doing purely elementwise operations, a very valid use case IMHO and is even more relavant now with the broadcast syntax, i.e.

    julia> function foo2(a)
              s = zero(eltype(a))
              @simd for i in eachindex(a)
                  @inbounds a[i] = @evalpoly(a[i], 2.3, 4.2, 1.6, 8.9, 0.8)
              end
              return s
          end

    Then LLVM vectorizes it without any problem (3.8.1 at least)

P.S. I strongly recommend checking code_llvm even for people who know assembly well. It's much easier to notice the problem with @inbounds in the llvm ir.

stevengj commented 8 years ago

Ah, thanks. Even so, it might be nice to have a SIMD variant of @evalpoly for use in special functions.

yuyichao commented 8 years ago

for use in special functions.

Or cases not meant to/possible to be vectorized in a loop in general, I agree. It would be nice if the new algorithm won't regret in the vectorize loop case but I guess the answer is that it might?

musm commented 8 years ago

Estrin's method ... I got bored in class~but the following is already more efficient than @horner~

Only works for odd polynomials for now. I don't think simd is being attached however, could anyone confirm?

#only for odd polynomials
#c in descending order
using BenchmarkTools

x = 2.0
c = collect(Float64, 8:-1:1)
function estrin(x,c)
    n = length(c)
    m = n÷2
    ap = c
    a = similar(c)
    X = x
    while n > 1
        @simd for i = 1:m
            @inbounds a[m+1-i] = muladd(ap[n+1-(2i-1)], X, ap[n+1-(2i-1)-1])
        end
        ap[:] = a[:]
        X = X*X
        n = m
        m = m÷2
    end
    return ap[1]
end

f(x,c) = (@evalpoly x c[1] c[2] c[3] c[4] c[5] c[6] c[7] c[8])

@benchmark estrin(x,c)
@benchmark f(x,c)

There are a lot of other opportunities for parallelism for simple polynomial multiplication, though maybe simpler is better; for e.g. all the X*X for the various powers can also be computed in parallel.

yuyichao commented 8 years ago

What are you trying to achieve with globals?

musm commented 8 years ago

My bad, fixed, thanks!

yuyichao commented 8 years ago

There's also unnecessary copy/allocation. ap should be updated in place.

yuyichao commented 8 years ago

Or even better, IIUC, you can just swap the two buffers.

yuyichao commented 8 years ago

I don't think you can get vectorization with non-unit stride (there's trick to kind of get it working though).

I just realize this is the evalpoly issue (I thought this is one of the special/libm function ones......) in this case any allocation is already a huge loose. @evalpoly generates an expression at parse time and the replacement should do that too.

musm commented 8 years ago

would

    @inbounds ap[1:m] = a[1:m]

swap the buffers? or am I misunderstanding what you are saying ?

yuyichao commented 8 years ago

ap, a = a, ap

ap[1:m] = a[1:m] still allocating a new array and copying the data. ap[:] = a won't allocate, but will still copy.

stevengj commented 8 years ago

@musm, rather than being "more efficient", in a quick benchmark I find your estrin function to be about 50x slower than the inlined polynomial evaluation of @evalpoly for this degree-7 polynomial. This is not surprising, because the difference between inlined polynomial evaluation and any kind of runtime loop is vast (in addition to things like inadvertent array allocation that handicap your implementation).

This issue is about improving the inlined @evalpoly; let's not side-track it into a discussion of how to optimize some runtime-loop code.

musm commented 8 years ago

apologies

stevengj commented 8 years ago

No worries @musm, I just wanted to nip the digression in the bud.

musm commented 8 years ago

On the topic of improving @evalpoly

However, it seems likely to be faster to try and exploit SIMD instructions for these polynomial evaluations, e.g. with Estrin's algorithm or some other technique.

I have a decent implementation of Estrin's algorithm that's a regular function (for odd polynomials, still). I have two versions: one without simd which is 10x slower than @horner and one that takes advantage of simd which is 40x slower, based on benchmarks.

EDIT It's not exactly clear how one could turn this algorithm to a macro that would expand to avoid runtime loops and take advantage of simd operations. If I understand things, would it be possible to expand the macro to something like the following pseudo code yes it is and I have the code for this

evaluate : c8 x^7 + c7 x^6 + ... + c1

@simd begin
a1 = c2*x + c1 
a2 = c4*x + c3
a3 = c6*x + c5
a4 = c8*x + c7
end
x = x*x
@simd begin
a1 = a2*x + a1 
a2 = a4*x + a3
end
x = x*x
@simd begin
a1 = a2*x + a1 
end
return a1
yuyichao commented 8 years ago

@simd used this way is a no-op. The vectorization here requires writing it in a way that can be recognized by LLVM passes we add with -O3 or explicitly using VecElement.

eschnett commented 8 years ago

The SIMD package provides the respective operations, based on Base.VecElement:

using SIMD
function evalpoly{T}(c1,c2,c3,c4,c5,c6,c7,c8, x::T)
    c2468 = Vec{4,T}((c2,c4,c6,c8))
    c1357 = Vec{4,T}((c1,c3,c5,c7))
    a1234 = c2468 * x + c1357
    x = x^2
    a24 = Vec{2,T}((a1234[2], a1234[4]))
    a13 = Vec{2,T}((a1234[1], a1234[3]))
    a12 = a24 * x + a13
    x = x^2
    a2 = Vec{1,T}((a12[2],))
    a1 = Vec{1,T}((a12[1],))
    r = a2 * x + a1
    r[1]
end

evalpoly(1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0, 0.5)
@code_native evalpoly(1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0, 0.5)

If you look at the generated code, you'll see immediately why this turns out to be very slow:

julia> @code_native evalpoly(1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0, 0.5)
        .section        __TEXT,__text,regular,pure_instructions
Filename: REPL[20]
        pushq   %rbp
        movq    %rsp, %rbp
        vmovsd  16(%rbp), %xmm8         ## xmm8 = mem[0],zero
Source line: 121
        vunpcklpd       %xmm7, %xmm5, %xmm5 ## xmm5 = xmm5[0],xmm7[0]
        vunpcklpd       %xmm3, %xmm1, %xmm1 ## xmm1 = xmm1[0],xmm3[0]
        vinsertf128     $1, %xmm5, %ymm1, %ymm1
Source line: 121
        vunpcklpd       %xmm6, %xmm4, %xmm3 ## xmm3 = xmm4[0],xmm6[0]
        vunpcklpd       %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0],xmm2[0]
        vinsertf128     $1, %xmm3, %ymm0, %ymm0
Source line: 114
        vbroadcastsd    %xmm8, %ymm2
Source line: 567
        vmulpd  %ymm2, %ymm1, %ymm1
Source line: 567
        vaddpd  %ymm1, %ymm0, %ymm0
Source line: 5
        vmulsd  %xmm8, %xmm8, %xmm1
Source line: 121
        vextractf128    $1, %ymm0, %xmm2
Source line: 121
        vunpckhpd       %xmm2, %xmm0, %xmm3 ## xmm3 = xmm0[1],xmm2[1]
Source line: 121
        vunpcklpd       %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0],xmm2[0]
Source line: 114
        vmovddup        %xmm1, %xmm2    ## xmm2 = xmm1[0,0]
Source line: 567
        vmulpd  %xmm3, %xmm2, %xmm2
Source line: 567
        vaddpd  %xmm2, %xmm0, %xmm0
Source line: 9
        vmulsd  %xmm1, %xmm1, %xmm1
Source line: 121
        vpermilpd       $1, %xmm0, %xmm2 ## xmm2 = xmm0[1,0]
Source line: 567
        vmulsd  %xmm2, %xmm1, %xmm1
Source line: 567
        vaddsd  %xmm1, %xmm0, %xmm0
Source line: 13
        popq    %rbp
        vzeroupper
        retq

There are fewer actual arithmetic operations (vmul, vadd), but there are many additional operations to move data between vector lanes.

The first six operations (vunpcklpd and vinsertf128) are merely setting up the constants c1...c8; these would normally be known at compile time, making those instructions superfluous. The other vextract, vunpck, vmov, and vperm instructions remain necessary.

musm commented 8 years ago

Thanks @eschnett I created a macro that expands out the operation's in Estrin's algorithm to explore possible benefits of vectorization for polynomial evaluation using your SIMD package. See: https://github.com/musm/Estrin.jl (I might rename it later)

eschnett commented 8 years ago

@musm Nice!

The expensive part is currently switching between different vector sizes. The following might lead to a faster algorithm:

However, if this evalpoly is called inside a loop, then I expect things to be much faster if evalpoly is kept scalar, and the loop is vectorized.

musm commented 8 years ago

@eschnett I'll try to take those points into consideration in an improved implementation.

Another very simple option for parallelism in @polyeval is to do an even odd split of the polynomial at the cost of one additional multiplication


using BenchmarkTools
using Base.Math.@horner

c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10 = (1:11...)
x = 2.0

function f1(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
    x2 = x*x
    begin # ideally we should be able to tell the compiler to run these two in parallel, is threads the right       approach here? I.e. run these two on separate threads
        cc1 = @horner x2 c1 c3 c5 c7 c9    # the odds
        cc2 = @horner x2 c2 c4 c6 c8 c10  # the evens
    end
    c0 + x*cc1 + x2 * cc2
end

function f2(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
    @horner x c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
end

g1(x) = f1(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)
g2(x) = f2(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)

xx = 5.0*randn(1_000_000)
@show @benchmark g1.(xx)
@show @benchmark g2.(xx)

On my computer with -O3 this gives

julia> @show @benchmark f1(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)

BenchmarkTools.Trial:
  samples:          36
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  53.41 mb
  allocs estimate:  3000055
  minimum time:     136.63 ms (3.22% GC)
  median time:      140.42 ms (3.49% GC)
  mean time:        141.52 ms (3.78% GC)
  maximum time:     155.14 ms (4.54% GC)

julia> @show @benchmark f2(x, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10)

BenchmarkTools.Trial:
  samples:          13
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  328.07 mb
  allocs estimate:  21000055
  minimum time:     391.01 ms (7.11% GC)
  median time:      399.62 ms (7.22% GC)
  mean time:        415.16 ms (10.21% GC)
  maximum time:     499.06 ms (24.32% GC)

This is about x3 faster.

Is the compiler automatically able to determine that the two horner evaluations can be run concurrently in this example?

ViralBShah commented 8 years ago

Julia will not yet use multiple threads for concurrent evaluation, but perhaps this is benefiting due to better codegen.

yuyichao commented 8 years ago

It's very unlikely cheap operations like this will benefit from threading at all.

ViralBShah commented 8 years ago

@musm The time difference seems to be due partially to much higher gc. Can you either call gc() before each run, or do gc_disable() and run the tests?

ViralBShah commented 8 years ago

Also, I wonder if having all the variables in global scope affects this. Perhaps wrap everything in a function?

yuyichao commented 8 years ago

g1.($xx) instead of g1.(xx) should get rid of the global variable although I doubt it matters much. And it's better to not use the boardcast form since as mentioned before it's the form that can be more easily loop vectorized with the old/scalar form. Using the broadcast from also means that you are sensitive to many unrelated factors including memory allocation, garbage collection and memory bandwidth.

yuyichao commented 8 years ago

although I doubt it matters much.

I take that back, I didn't realize that all the c*s are in global scope. Yeah, that's terrible.....

KristofferC commented 8 years ago

If you see that you are allocating 300 MB in your benchmark that should just evaluate a polynomial then you should look closer in how you are benchmarking. The absolute timings you are getting are also unreasonable with some napkin math of cpu freq and number of ops required.

stevengj commented 8 years ago

Because the coefficients are known at parse time in the macro, it should be able to decide on the optimal vector size at parse time and/or compile-time. There should be no runtime switching.

(There can be checks like isa(z, Float64), but my understanding is that this check ends up getting evaluated at compile-time and there is no runtime cost.)

stevengj commented 8 years ago

(@musm, this kind of low-level performance hacking is probably not a good introductory issue, because it requires a lot of understanding of Julia internals.)

eschnett commented 8 years ago

An even/odd splitting like this is likely to increase performance. Modern CPUs can execute instructions out-of-order, and given the number of execution ports they have, it is likely that both streams can be executed simultaneously.

Alternatively, you can execute the two streams simultaneously by using 2-element SIMD vectors. That's very similar to what I suggested earlier, except with a vector size of 2 instead of 4.

As others said, multi-threading will not help here. Threading introduces an overhead of about 1 us (roughly), whereas instructions are executed in fractions of a nanosecond. Thus you'll need tens of thousands of instructions to recoup the threading overhead.

musm commented 8 years ago

Sorry about the lack of rigorous benchmarking, I should've known better, but It was late : P

To test this I implemented a macro that does an even odd split. You can do Pkg.clone("https://github.com/musm/ParPoly.jl.git")

using Benchmarks, PyPlot
using Base.Math.@horner, ParPoly

x = 2.0
nn = 4:25
t_horner_split = []
t_horner = []
xt_horner_split = []
xt_horner = []

t_horner_split_m = []
t_horner_m = []
for n in nn
    println("running bench for order $(n-1)")
    c = randn(Float64,n)

    @gensym f1
    @gensym f2
    @eval begin
        $f1(x) = @horner_split x $(c...)
        $f2(x) = @horner x $(c...)
        t_hs   = @benchmark $f1(x)
        t_h    = @benchmark $f2(x)
    end
    push!(t_horner_split_m, Benchmarks.SummaryStatistics(t_hs).elapsed_time_center)
    push!(t_horner_m, Benchmarks.SummaryStatistics(t_h).elapsed_time_center)

end
plot(nn-1, t_horner_split_m,label="horner_split")
plot(nn-1, t_horner_m,label="horner")
legend(loc="upper left")
xlabel("polynomial order")
ylabel("time (ns)")

Alternatively, you can execute the two streams simultaneously by using 2-element SIMD vectors. That's very similar to what I suggested earlier, except with a vector size of 2 instead of 4.

If I understand correctly do you mean something like the following?


x1 = 2.0
x2 = x1*x1
x1x2 = Vec{2,Float64}((x1,x2))
xx   = Vec{2,Float64}((x2,x2))
cc1  = Vec{2,Float64}((c1,c2))
cc2  = Vec{2,Float64}((c3,c4))
cc3  = Vec{2,Float64}((c5,c6))
cc4  = Vec{2,Float64}((c7,c8))
cc5  = Vec{2,Float64}((c9,c10))
C = (cc1+xx*(cc2 + xx*(cc3 + xx*(cc4 + xx*cc5))))
val = c0 + sum(x1x2*C)

This would need to allocate a lot of new variables (not sure such a large number of allocation would outweigh the possible increase of execution speed) I would think that a more parallelized version would benefit stronger from such an approach a la estrin's scheme or similar.

yuyichao commented 8 years ago

Again, don't use the broadcast to benchmark, use the scalar form directly. Also those Vecs don't allocate at all.

yuyichao commented 8 years ago

@benchmark is good enough, don't write your own loop..................................................

yuyichao commented 8 years ago

I also believe that you should be seeing the function overwrite warning and you should not ignore it. You are most likely benchmarking only a single function which is why the scaling in your result doesn't make sense.

musm commented 8 years ago

@yuyichao Does that look better? I do see them, the methods get overwritten, what's the problem then? Should probably be semilogy plot...

yuyichao commented 8 years ago

what's the problem then

265

yuyichao commented 8 years ago

The time the low order polynomials is way to long. You shouldn't get tens of ns for that unless you are using a really slow CPU. Try,

for n in nn
    c = randn(Float64,n)
    @gensym f1
    @gensym f2
    @eval begin
        $f1(x) = @horner_split x $(c...)
        $f2(x) = @horner x $(c...)
        t_hs = @benchmark $f1($x)
        t_h  = @benchmark $f2($x)
    end
    push!(t_horner_split, median(t_hs).time)
    push!(t_horner, median(t_h).time)
end
StefanKarpinski commented 8 years ago

@musm, good investigation – benchmarking this kind of stuff is tricky and this is how you learn :)

yuyichao commented 8 years ago

It does seem that you might hit https://github.com/JuliaCI/BenchmarkTools.jl/issues/7, which is why I still use Benchmarks.jl instead BenchmarkTools.jl for small functions like this.....

musm commented 8 years ago

Anecdotally using Benchmarks.jl, @horner_split_simd edges out @horner and @horner_split especially for larger polynomial sizes. I can't however seem to get @benchmark working in an @eval using Benchmarks.jl (this is after I fixed a different bug in the Benchmarks.jl package )

EDIT: I also implemented @horner_split_simd macro, which does the horner_split method along the lines described above and eschnett's recommendations

code: https://github.com/musm/ParPoly.jl/blob/master/bench/bench.jl @yuyichao I know you recommended against rolling my own loops but this was the only way I could coerce BenchmarkTools to work reasonably well (borrowing the trick from simonbyrne)

out

musm commented 8 years ago

I've been doing some more benchmarking and I see very good results with the simple @horner_split, in various test cases it's usually 2x faster than the @horner method in base. This can probably be attributed to the fact that the cpu will just execute the two even and odd lines out of order. Interestingly, the @horner_simd macro, according to my tests, shows up as being consistently slower than the @horner_split method and the version in Base. Perhaps this can be attributed to the overhead in loading and unloading the vectors in the vector types?

In any case using the horner_split macro in the libm.jl package, I see very nice perf gains.

c.f. SIMD version

julia> @code_llvm g(x)

; Function Attrs: uwtable
define double @julia_g_66617(double) #0 {
top:
  %1 = fmul double %0, %0
  %2 = insertelement <2 x double> undef, double %0, i32 0
  %3 = insertelement <2 x double> %2, double %1, i32 1
  %4 = insertelement <2 x double> undef, double %1, i32 0
  %5 = insertelement <2 x double> %4, double %1, i32 1
  %res.i = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> <double 1.400000e+01, double 1.400000e+01>, <2 x double> <double 1.200000e+01, double 1.300000e+01>)
  %res.i.1 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i, <2 x double> <double 1.000000e+01, double 1.100000e+01>)
  %res.i.2 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.1, <2 x double> <double 8.000000e+00, double 9.000000e+00>)
  %res.i.3 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.2, <2 x double> <double 6.000000e+00, double 7.000000e+00>)
  %res.i.4 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.3, <2 x double> <double 4.000000e+00, double 5.000000e+00>)
  %res.i.5 = call <2 x double> @llvm.fmuladd.v2f64(<2 x double> %5, <2 x double> %res.i.4, <2 x double> <double 2.000000e+00, double 3.000000e+00>)
  %res.i.6 = fmul <2 x double> %3, %res.i.5
  %vec_1_1.i = shufflevector <2 x double> %res.i.6, <2 x double> undef, <1 x i32> zeroinitializer
  %vec_1_2.i = shufflevector <2 x double> %res.i.6, <2 x double> undef, <1 x i32> <i32 1>
  %vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i
  %res.i.7 = extractelement <1 x double> %vec_1.i, i32 0
  %6 = fadd double %res.i.7, 1.000000e+00
  ret double %6
}

Serial version of the split macro

julia> @code_llvm f(x)

; Function Attrs: uwtable
define double @julia_f_66618(double) #0 {
top:
  %1 = fmul double %0, %0
  %2 = call double @llvm.fmuladd.f64(double %1, double 1.400000e+01, double 1.200000e+01)
  %3 = call double @llvm.fmuladd.f64(double %1, double %2, double 1.000000e+01)
  %4 = call double @llvm.fmuladd.f64(double %1, double %3, double 8.000000e+00)
  %5 = call double @llvm.fmuladd.f64(double %1, double %4, double 6.000000e+00)
  %6 = call double @llvm.fmuladd.f64(double %1, double %5, double 4.000000e+00)
  %7 = call double @llvm.fmuladd.f64(double %1, double %6, double 2.000000e+00)
  %8 = call double @llvm.fmuladd.f64(double %1, double 1.400000e+01, double 1.300000e+01)
  %9 = call double @llvm.fmuladd.f64(double %1, double %8, double 1.100000e+01)
  %10 = call double @llvm.fmuladd.f64(double %1, double %9, double 9.000000e+00)
  %11 = call double @llvm.fmuladd.f64(double %1, double %10, double 7.000000e+00)
  %12 = call double @llvm.fmuladd.f64(double %1, double %11, double 5.000000e+00)
  %13 = call double @llvm.fmuladd.f64(double %1, double %12, double 3.000000e+00)
  %14 = fmul double %7, %0
  %15 = fadd double %14, 1.000000e+00
  %16 = fmul double %1, %13
  %17 = fadd double %15, %16
  ret double %17
ViralBShah commented 8 years ago

Seems like we should update the @horner in Base with the new one. Thoughts from others?

simonbyrne commented 8 years ago

Seems like we should update the @horner in Base with the new one. Thoughts from others?

Well, technically it would no longer be a Horner scheme (there is a chance that accuracy guarantees are predicated on it being evaluated in this manner)

musm commented 8 years ago

I think Viral meant update @evalpoly... Sometime soon I'm going to update the Estrin scheme code (it's not efficient yet) and compare ( but I suspect the simple split scheme may still perform best)

simonbyrne commented 8 years ago

For help with these sort of low-level optimisations, you might want to check out https://github.com/carnaval/IACA.jl

musm commented 8 years ago

Edit: sensible benchmark results posted above. Looks like the overhead of setting up SIMD doesn't pay off until the polynomial is order 18 and I'm guessing very few people are evaluating polynomials greater than order 20... I'd suggest @horner for order less than 5, and @horner_split for the rest...

eschnett commented 8 years ago

Did you try setting up more than 2 independent instruction streams? E.g. using 4 streams might be even faster, given that multiplication and addition has several cycles of latency.