Use LAPack for polynomials ≤ degree ~450 (real) ~250 (complex)?

dlfivefifty commented 9 years ago

In my tests LAPack's zhseqr/dhseqr is faster unless the polynomials are of higher degree, using my wrapper

https://github.com/ApproxFun/ApproxFun.jl/blob/master/src/LinearAlgebra/hesseneigs.jl

Maybe its worth wrapping these in AMVW for such cases? Unless I'm missing something.

andreasnoack commented 9 years ago

I think it depends on scope of this package. For a roots function in a general polynomial package I completely agree with you, but this package was mainly meant as a wrapper for the code in the paper. I think the best thing would be to let the Polynomials package depend of this package and use the AMVW solver when the degree is very high.

thomasmach commented 9 years ago

I am confused with this observation. In my tests comparing the Fortran code of AMVW with Fortran LAPACK the LAPACK code is only faster for degree smaller than 20. Why do the tests in Fortran and in julia differ so much? And do your Lapack tests include the time for assembling the matrix?

dlfivefifty commented 9 years ago

Hmm, I reran the tests and am getting very different numbers, now the critical point seems to be about 75.  I suspect my laptop was in low battery mode yesterday which can be unpredictable.  The LAPack routine also has a weird jump in time at n~75 where it goes from being about 30% faster to 30% slower.  I suspect this is an OpenBLAS quirk where it switches modes too early, but setting blas_set_num_threads(1) didn’t seem to have an effect.

gc_disable() p=[1.0:80]

@time for k=1:30 rootsjul(p) end @time for k=1:30 rootshess(p) end @time for k=1:30 AMVW.rootsAMVW(p) end

import Base.BLAS: blasint,BlasChar,BlasInt for (hseqr,elty) in ((:zhseqr,:Complex128),) @eval function hesseneigvals(M::Matrix{$elty}) A=vec(M)

    N=size(M,1)
    info  =blas_int(0)

    ilo = blas_int(1); ihi = blas_int(N); ldh=blas_int(N);ldz=blas_int(N);lwork = blas_int(N)

    z=zero($elty)
    work  = Array($elty, N*N)
    w=Array($elty,N)

    Ec='E'
    Nc='N'
    ccall(($(BLAS.blasfunc(hseqr)),LAPACK.liblapack),
        Void,
        (Ptr{BlasChar},Ptr{BlasChar},
    Ptr{BlasInt},Ptr{BlasInt},Ptr{BlasInt},Ptr{$elty}, #A
    Ptr{BlasInt},Ptr{$elty},Ptr{$elty}, #z
    Ptr{BlasInt},Ptr{$elty},Ptr{BlasInt},Ptr{BlasInt}),
    &Ec,&Nc,&N , &ilo, &ihi, A, &ldh, w, &z, &ldz, work, &lwork, &info) 
    w
end

end

for (hseqr,elty) in ((:dhseqr_,:Float64),) @eval function hesseneigvals(M::Matrix{$elty}) A=vec(M)

    N=size(M,1)
    info  =blas_int(0)

    ilo = blas_int(1); ihi = blas_int(N); ldh=blas_int(N);ldz=blas_int(N);

    lwork = blas_int(-1)

    z=zero($elty)
    work  = Array($elty, 1)
    wr=Array($elty,N)
    wi=Array($elty,N)

    Ec='E'
    Nc='N'
    for i=1:2
        ccall(($(BLAS.blasfunc(hseqr)),LAPACK.liblapack),
            Void,
            (Ptr{BlasChar},Ptr{BlasChar},
        Ptr{BlasInt},Ptr{BlasInt},Ptr{BlasInt},Ptr{$elty}, #A
        Ptr{BlasInt},Ptr{$elty},Ptr{$elty},Ptr{$elty}, #z
        Ptr{BlasInt},Ptr{$elty},Ptr{BlasInt},Ptr{BlasInt}),
        &Ec,&Nc,&N , &ilo, &ihi, A, &ldh, wr,wi, &z, &ldz, work, &lwork, &info) 

        if lwork < 0
            lwork=blas_int(real(work[1]))
            work=Array($elty,lwork)
        end
    end

    wr+im*wi    
end

end

function companion_matrix{T}(c::Vector{T}) n=length(c)-1

if n==0
    T[]
else
    A=zeros(T,n,n)
    @simd for k=1:n
        @inbounds A[k,end]=-c[k]/c[end]
    end
    @simd for k=2:n
        @inbounds A[k,k-1]=one(T)
    end
    A
end

end

rootshess(c)=hesseneigvals(companion_matrix(c)) rootsjul(c)=eigvals(companion_matrix(c))

On 9 Jan 2015, at 2:30 am, Thomas Mach notifications@github.com wrote:

I am confused with this observation. In my tests comparing the Fortran code of AMVW with Fortran LAPACK the LAPACK code is only faster for degree smaller than 20. Why do the tests in Fortran and in julia differ so much? And do your Lapack tests include the time for assembling the matrix?

— Reply to this email directly or view it on GitHub https://github.com/andreasnoack/AMVW.jl/issues/3#issuecomment-69195065.

andreasnoack commented 9 years ago

I just tried this on my machine and I also get a crossover at 75. This was the same for OpenBLAS and MKL (which is not surprising because the QR algorithm is not that BLAS heavy). roottimings

The dhseqr routine is pretty wild and it appears to change algorithm at exactly n=75. See the "further details" section. From the graph I conjecture that the cross over point could be moved quite a bit (log scale) to the right if the point for algorithm change is modified.

andreasnoack commented 9 years ago

I've tried to change the default in LAPACK's dhseqr such that it uses the algorithm for small problems until the size is 500. This gives the following plot

roottimings2

and the crossover is somewhere around 120-130.

@thomasmach The timings include the construction of the companion matrix.

alanedelman commented 9 years ago

Encouragement: glad you guys are looking into this. While at it, maybe see if answers agree very as well as the numerics would expect?

thomasmach commented 9 years ago

For some reason Lapack is a relative factor of two than in my Fortran tests for AMVW. I don't why and will have a look into this next week.

andreasnoack commented 9 years ago

The scaling coefficients of the last plots are

AMVW: 1.85
Dense: 2.34

thomasmach commented 9 years ago

@andreasnoack Does you last comment mean that dhseqr is not in O(n^3)?

For the complex case the crossover point is much smaller, since AMVW's complex single shift version is only slightly slower, while zhseqr is almost 3 times slower than dhseqr.

andreasnoack commented 9 years ago

@thomasmach No I don't think so. The asymptotic behavior just takes a little longer to kick in. There is a significant second order effect in the fit of the dgseqr line and if I extend the degree of the polynomial, the slope approaches 3.

You are also completely right about the complex case. The plot for the complex case looks

roottimings3

and the cross over is 40. Why is your real solver relatively slower than LAPACK's?

thomasmach commented 9 years ago

Why is the complex code of AMVW almost as fast as the real code?

Lapack's complex arithmetic is about twice as expensive as the real arithmetic. We, however use complex rotations represented by 3 reals and real rotations represented by 2 reals. Thus, in our case the complex arithmetic is, (very) roughly speaking, 1.5 times as expensive as real arithmetic.
The real double shift code requires 7 turnovers compared to 3 turnovers in the complex single shift code. This means 1/7 more arithmetic operations.

Additionally, the real double shift code needs slightly more iterations to converge.

andreasnoack / FastPolynomialRoots.jl

Use LAPack for polynomials ≤ degree ~450 (real) ~250 (complex)? #3