Julia now incompatible with Accelerate/vecLib

staticfloat commented 12 years ago

When I specify USE_SYSTEM_LAPACK=1 USE_SYSTEM_BLAS=1 on my make commandline, everything compiles fine but when I run make test I get the now-familiar:

symbol could not be found dsytri2_ (-1): dlsym(0x7fff6b652d58, dsytri2_): symbol not found

on the lapack step. This is because dsytri2 does not exist in the version of LAPACK that is in Accelerate. (I'm running 10.8.2, so I doubt any other versions of OSX have this in there either).

For a quick-and-dirty performance comparison, I ran mean([peakflops() for x=1:10]) on a Julia compile with OpenBLAS:

julia> mean([peakflops() for x=1:10])
The peak flop rate is 10.392558238646043 gigaflops
The peak flop rate is 10.236993070730415 gigaflops
The peak flop rate is 9.788183678499907 gigaflops
The peak flop rate is 9.854966680367813 gigaflops
The peak flop rate is 10.483553539590964 gigaflops
The peak flop rate is 9.988794083128239 gigaflops
The peak flop rate is 9.915385212070351 gigaflops
The peak flop rate is 9.84954842629849 gigaflops
The peak flop rate is 9.726793501215036 gigaflops
The peak flop rate is 10.010297460479839 gigaflops
1.002470738910271e10

and a Julia compile with Accelerate:

julia> mean([peakflops() for x=1:10])
The peak flop rate is 9.391952649538345 gigaflops
The peak flop rate is 9.9861691225281 gigaflops
The peak flop rate is 10.065146945233934 gigaflops
The peak flop rate is 9.777375402644262 gigaflops
The peak flop rate is 10.059198131484685 gigaflops
The peak flop rate is 10.008770163665776 gigaflops
The peak flop rate is 9.599468108084602 gigaflops
The peak flop rate is 10.544060416556357 gigaflops
The peak flop rate is 10.452221434472138 gigaflops
The peak flop rate is 10.352154128305639 gigaflops
1.0023651650251383e10

There's no appreciable difference, in my opinion, so for the sake of focused effort, I propose not littering Julia source with #ifdef's and hanging Accelerate out to dry.

ViralBShah commented 12 years ago

I agree that we should want to use Accelerate. However, the right thing may be to only use BLAS from Accelerate, and download lapack. LAPACK still is undergoing active development, and it would be nice to have all our users on the same version of LAPACK.

ViralBShah commented 12 years ago

cc: @dmbates

dmbates commented 12 years ago

It would not be a big deal to revert to using dsytri rather than dsytri2. In the long run it would be good to keep up with developments in Lapack but it is not worthwhile losing access to Accelerate for this one, infrequently used, subroutine.

I'm not sure why a comparison of results of peakflops would be relevant. That just measures speed of dgemm calculations.

staticfloat commented 12 years ago

peakflops is just what I had seen used previously. Do we have a better method of testing overall LAPACK performance?

dmbates commented 12 years ago

@staticfloat Perhaps I misunderstood the purpose. A call to peakflops measures the speed of one of the BLAS routines. I thought you wanted to compare dsytri and dsytri2. I really don't expect there to be much of a difference but, as shown in the discussion of #1312, you don't really know until you check.

If Accelerate is not substantially faster than openblas on OS X is there a reason for using Accelerate? I am not opposed to Accelerate - I'm just asking.

staticfloat commented 12 years ago

Honestly, I don't know if there is any speed difference between the two; as I have experience building Julia on OSX, I was going to try and arrange a Linera Algebra shootout, but then discovered that there was this incompatibility that crept in a few weeks ago.

I think my last sentence above was written poorly, it should have been written: "I propose not littering Julia source with #ifdefs to retain compatibility with two versions of LAPACK, therefore I propose hanging Accelerate out to dry".

However, I am going to hold off on saying that until we can prove that OpenBLAS is at least comparable to Accelerate in terms of performance. @dmbates, do you have a list of lapack functions that I can run that we should concern ourselves with, performance-wise? I'll do the shootout if you tell me how.

dmbates commented 12 years ago

There are several Lapack-based decompositions and operations tested in test/lapack.jl. It would not be worthwhile timing these operations on 10 by 10 matrices as they don't take sufficient time to provide meaningful results. Perhaps by bumping the sizes to 1000 by 1000 for square matrices and 10000 by 100 for rectangular it would become more meaningful.

staticfloat commented 12 years ago

Alright. I'll take a look at this tonight, see if I can put something together. -E

On Mon, Oct 1, 2012 at 10:39 AM, dmbates notifications@github.com wrote:

There are several Lapack-based decompositions and operations tested in test/lapack.jl. It would not be worthwhile timing these operations on 10 by 10 matrices as they don't take sufficient time to provide meaningful results. Perhaps by bumping the sizes to 1000 by 1000 for square matrices and 10000 by 100 for rectangular it would become more meaningful.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/1322#issuecomment-9042296.

StefanKarpinski commented 12 years ago

On my machine at least (the "official" benchmark machine) , Accelerate and MKL (via Matlab) used to both be significantly faster than OpenBLAS, but recent versions of OpenBLAS have closed and even reversed that gap — and will probably keep getting better. I think it's desirable to continue to support Accelerate easily, however, if it's not too difficult, which it doesn't seem to be.

staticfloat commented 12 years ago

@StefanKarpinkski I know that there has been discussion previously about how difficult making the entire performance suite run on different machines would be, but for the specific test of LAPACK/BLAS performance, how did you put that together?

StefanKarpinski commented 12 years ago

Not sure what you mean — which LAPACK/BLAS performance test? I was mainly just talking about DGEMM performance, as measured by the rand_mat_mul benchmark, although OpenBLAS performance has also improved on other primitives too.

staticfloat commented 12 years ago

Oh, I assumed that when you said "Accelerate and MKL (via Matlab) used to both be significantly faster than OpenBLAS" that you had some method of testing more than just dgemm. I don't know which BLAS routines are the most important to test; I'm assuming we should test more than just dgemm to really measure the difference between two implementations?

StefanKarpinski commented 12 years ago

It was kind of an anecdotal observation, definitely not broadly tested, although test/LAPACK.jl is probably a decent performance suite as well, although the problems may not be big enough.

ViralBShah commented 12 years ago

OpenBLAS was slower than Accelerate, because it did not have Sandybridge support until recently. In general, openblas will always be slower on newer architectures as they catch up, while MKL will have support for new architectures from day one.

Much of LAPACK performance depends on DGEMM. If we were to set up a benchmark, I would try, Accelerate (BLAS + LAPACK) vs. OpenBLAS (BLAS + LAPACK) vs. Accelerate BLAS + Netlib LAPACK. My guess is that the performance difference by using Accelerate or MKL LAPACK is not noticeable, perhaps except in a few cases. We can even check the Apple opensource repositories to see if they have changed anything in LAPACK.

What would be interesting to see is performance of matrices with weird aspect ratios (tall and skinny, short and fat), etc. BLAS performance varies dramatically. Also try decompositions of weird shapes.

staticfloat commented 12 years ago

@dmbates; Would it be possible for you to point me in the right direction for substituting dystri for dystri2? I must admit, I am a little lost.

dmbates commented 12 years ago

This will change after @ViralBShah commits his reorganization of the linear algebra code. For the moment, the source file base/lapack.jl has

for (syconv, syev, sysv, sytrf, sytri, sytrs, elty) in
    ((:dsyconv_,:dsyev_,:dsysv_,:dsytrf_,:dsytri2_,:dsytrs_,:Float64),
     (:ssyconv_,:ssyev_,:ssysv_,:ssytrf_,:ssytri2_,:ssytrs_,:Float32),
     (:zheconv_,:zheev_,:zhesv_,:zhetrf_,:zhetri2_,:zhetrs_,:Complex128),
     (:checonv_,:cheev_,:chesv_,:chetrf_,:chetri2_,:chetrs_,:Complex64))
    @eval begin

in the (SY) section. Just delete the 2's in those symbols.

@ViralBShah I'll let you decide if this should be changed in the master branch or left as is so as not to complicate your merging your branch.

ViralBShah commented 12 years ago

Thanks - I can make that change on my branch. Will this have a performance impact?

-viral

On 02-Oct-2012, at 8:33 PM, dmbates wrote:

This will change after @ViralBShah commits his reorganization of the linear algebra code. For the moment, the source file base/lapack.jl has

for (syconv, syev, sysv, sytrf, sytri, sytrs, elty) in

((:dsyconv,:dsyev,:dsysv,:dsytrf,:dsytri2,:dsytrs,:Float64),

(:ssyconv,:ssyev,:ssysv,:ssytrf,:ssytri2,:ssytrs,:Float32),

(:zheconv,:zheev,:zhesv,:zhetrf,:zhetri2,:zhetrs,:Complex128),

(:checonv,:cheev,:chesv,:chetrf,:chetri2,:chetrs,:Complex64))

@eval begin in the (SY) section. Just delete the 2's in those symbols.

@ViralBShah I'll let you decide if this should be changed in the master branch or left as is so as not to complicate your merging your branch.

— Reply to this email directly or view it on GitHub.

dmbates commented 12 years ago

@ViralBShah I doubt there will be a noticeable performance impact. The only time that dsytri2 will be substantially faster than dsytri is when taking the inverse of a large symmetric non-positive-definite matrix, not an operation that is done frequently. Statisticians do use the inverse of a positive-definite symmetric matrices in calculating the variance-covariance of parameter estimates in linear models, etc. but that is done with dpotri.

andreasnoack commented 12 years ago

I think that the argument lwork is only in xsytri2 and hence the fix is slightly more than deleting the 2's.

dmbates commented 12 years ago

@andreasnoackjensen You're correct. The function should be changed to


        function sytri!(uplo::LapackChar, A::StridedMatrix{$elty}, ipiv::Vector{Int32})
            chkstride1(A)
            chksquare(A)
            n     = size(A,1)
            work  = Array($elty, n)
            info  = Array(Int32, 1)
            ccall(dlsym(Base.liblapack, $(string(sytri))), Void,
                  (Ptr{Uint8}, Ptr{Int32}, Ptr{$elty}, Ptr{Int32},
                   Ptr{Int32}, Ptr{$elty}, Ptr{Int32}, Ptr{Int32}),
                  &uplo, &n, A, &stride(A,2), ipiv, work, info)
            if info[1] != 0 throw(LapackException(info[1])) end
            A
        end

In other words, remove lwork, change the dimension of work to n and delete the for i in 1:2 loop.

staticfloat commented 12 years ago

Thanks @dmbates. Exactly what I was looking for. To anyone else attempting this, you should note that you need to remove the Ptr{Int32} element corresponding to the lwork term as well; otherwise ccall() thinks you haven't passed enough parameters in.

On Tue, Oct 2, 2012 at 9:31 AM, dmbates notifications@github.com wrote:

@andreasnoackjensen https://github.com/andreasnoackjensen You're correct. The function should be changed to
    function sytri!(uplo::LapackChar, A::StridedMatrix{$elty}, ipiv::Vector{Int32})
        chkstride1(A)
        chksquare(A)
        n     = size(A,1)
        work  = Array($elty, n)
        info  = Array(Int32, 1)
        ccall(dlsym(Base.liblapack, $(string(sytri))), Void,
              (Ptr{Uint8}, Ptr{Int32}, Ptr{$elty}, Ptr{Int32},
               Ptr{Int32}, Ptr{$elty}, Ptr{Int32}, Ptr{Int32}),
              &uplo, &n, A, &stride(A,2), ipiv, work, info)
        if info[1] != 0 throw(LapackException(info[1])) end
        A
    end
In other words, remove lwork, change the dimension of work to n and delete the for i in 1:2 loop.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/1322#issuecomment-9077291.

dmbates commented 12 years ago

@staticfloat Right again. Thanks for catching that.

JuliaLang / julia

Julia now incompatible with Accelerate/vecLib #1322