Closed staticfloat closed 12 years ago
I agree that we should want to use Accelerate. However, the right thing may be to only use BLAS from Accelerate, and download lapack. LAPACK still is undergoing active development, and it would be nice to have all our users on the same version of LAPACK.
cc: @dmbates
It would not be a big deal to revert to using dsytri
rather than dsytri2
. In the long run it would be good to keep up with developments in Lapack but it is not worthwhile losing access to Accelerate for this one, infrequently used, subroutine.
I'm not sure why a comparison of results of peakflops
would be relevant. That just measures speed of dgemm
calculations.
peakflops
is just what I had seen used previously. Do we have a better method of testing overall LAPACK performance?
@staticfloat Perhaps I misunderstood the purpose. A call to peakflops
measures the speed of one of the BLAS routines. I thought you wanted to compare dsytri
and dsytri2
. I really don't expect there to be much of a difference but, as shown in the discussion of #1312, you don't really know until you check.
If Accelerate is not substantially faster than openblas on OS X is there a reason for using Accelerate? I am not opposed to Accelerate - I'm just asking.
Honestly, I don't know if there is any speed difference between the two; as I have experience building Julia on OSX, I was going to try and arrange a Linera Algebra shootout, but then discovered that there was this incompatibility that crept in a few weeks ago.
I think my last sentence above was written poorly, it should have been written: "I propose not littering Julia source with #ifdefs to retain compatibility with two versions of LAPACK, therefore I propose hanging Accelerate out to dry".
However, I am going to hold off on saying that until we can prove that OpenBLAS is at least comparable to Accelerate in terms of performance. @dmbates, do you have a list of lapack functions that I can run that we should concern ourselves with, performance-wise? I'll do the shootout if you tell me how.
There are several Lapack-based decompositions and operations tested in test/lapack.jl. It would not be worthwhile timing these operations on 10 by 10 matrices as they don't take sufficient time to provide meaningful results. Perhaps by bumping the sizes to 1000 by 1000 for square matrices and 10000 by 100 for rectangular it would become more meaningful.
Alright. I'll take a look at this tonight, see if I can put something together. -E
On Mon, Oct 1, 2012 at 10:39 AM, dmbates notifications@github.com wrote:
There are several Lapack-based decompositions and operations tested in test/lapack.jl. It would not be worthwhile timing these operations on 10 by 10 matrices as they don't take sufficient time to provide meaningful results. Perhaps by bumping the sizes to 1000 by 1000 for square matrices and 10000 by 100 for rectangular it would become more meaningful.
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/1322#issuecomment-9042296.
On my machine at least (the "official" benchmark machine) , Accelerate and MKL (via Matlab) used to both be significantly faster than OpenBLAS, but recent versions of OpenBLAS have closed and even reversed that gap — and will probably keep getting better. I think it's desirable to continue to support Accelerate easily, however, if it's not too difficult, which it doesn't seem to be.
@StefanKarpinkski I know that there has been discussion previously about how difficult making the entire performance suite run on different machines would be, but for the specific test of LAPACK/BLAS performance, how did you put that together?
Not sure what you mean — which LAPACK/BLAS performance test? I was mainly just talking about DGEMM performance, as measured by the rand_mat_mul benchmark, although OpenBLAS performance has also improved on other primitives too.
Oh, I assumed that when you said "Accelerate and MKL (via Matlab) used to both be significantly faster than OpenBLAS" that you had some method of testing more than just dgemm
. I don't know which BLAS routines are the most important to test; I'm assuming we should test more than just dgemm
to really measure the difference between two implementations?
It was kind of an anecdotal observation, definitely not broadly tested, although test/LAPACK.jl is probably a decent performance suite as well, although the problems may not be big enough.
OpenBLAS was slower than Accelerate, because it did not have Sandybridge support until recently. In general, openblas will always be slower on newer architectures as they catch up, while MKL will have support for new architectures from day one.
Much of LAPACK performance depends on DGEMM. If we were to set up a benchmark, I would try, Accelerate (BLAS + LAPACK) vs. OpenBLAS (BLAS + LAPACK) vs. Accelerate BLAS + Netlib LAPACK. My guess is that the performance difference by using Accelerate or MKL LAPACK is not noticeable, perhaps except in a few cases. We can even check the Apple opensource repositories to see if they have changed anything in LAPACK.
What would be interesting to see is performance of matrices with weird aspect ratios (tall and skinny, short and fat), etc. BLAS performance varies dramatically. Also try decompositions of weird shapes.
@dmbates; Would it be possible for you to point me in the right direction for substituting dystri
for dystri2
? I must admit, I am a little lost.
This will change after @ViralBShah commits his reorganization of the linear algebra code. For the moment, the source file base/lapack.jl has
for (syconv, syev, sysv, sytrf, sytri, sytrs, elty) in
((:dsyconv_,:dsyev_,:dsysv_,:dsytrf_,:dsytri2_,:dsytrs_,:Float64),
(:ssyconv_,:ssyev_,:ssysv_,:ssytrf_,:ssytri2_,:ssytrs_,:Float32),
(:zheconv_,:zheev_,:zhesv_,:zhetrf_,:zhetri2_,:zhetrs_,:Complex128),
(:checonv_,:cheev_,:chesv_,:chetrf_,:chetri2_,:chetrs_,:Complex64))
@eval begin
in the (SY) section. Just delete the 2's in those symbols.
@ViralBShah I'll let you decide if this should be changed in the master branch or left as is so as not to complicate your merging your branch.
Thanks - I can make that change on my branch. Will this have a performance impact?
-viral
On 02-Oct-2012, at 8:33 PM, dmbates wrote:
This will change after @ViralBShah commits his reorganization of the linear algebra code. For the moment, the source file base/lapack.jl has
for (syconv, syev, sysv, sytrf, sytri, sytrs, elty) in
((:dsyconv,:dsyev,:dsysv,:dsytrf,:dsytri2,:dsytrs,:Float64),
(:ssyconv,:ssyev,:ssysv,:ssytrf,:ssytri2,:ssytrs,:Float32),
(:zheconv,:zheev,:zhesv,:zhetrf,:zhetri2,:zhetrs,:Complex128),
(:checonv,:cheev,:chesv,:chetrf,:chetri2,:chetrs,:Complex64))
@eval begin in the (SY) section. Just delete the 2's in those symbols.
@ViralBShah I'll let you decide if this should be changed in the master branch or left as is so as not to complicate your merging your branch.
— Reply to this email directly or view it on GitHub.
@ViralBShah I doubt there will be a noticeable performance impact. The only time that dsytri2 will be substantially faster than dsytri is when taking the inverse of a large symmetric non-positive-definite matrix, not an operation that is done frequently. Statisticians do use the inverse of a positive-definite symmetric matrices in calculating the variance-covariance of parameter estimates in linear models, etc. but that is done with dpotri.
I think that the argument lwork is only in xsytri2 and hence the fix is slightly more than deleting the 2's.
@andreasnoackjensen You're correct. The function should be changed to
function sytri!(uplo::LapackChar, A::StridedMatrix{$elty}, ipiv::Vector{Int32})
chkstride1(A)
chksquare(A)
n = size(A,1)
work = Array($elty, n)
info = Array(Int32, 1)
ccall(dlsym(Base.liblapack, $(string(sytri))), Void,
(Ptr{Uint8}, Ptr{Int32}, Ptr{$elty}, Ptr{Int32},
Ptr{Int32}, Ptr{$elty}, Ptr{Int32}, Ptr{Int32}),
&uplo, &n, A, &stride(A,2), ipiv, work, info)
if info[1] != 0 throw(LapackException(info[1])) end
A
end
In other words, remove lwork
, change the dimension of work
to n
and delete the for i in 1:2
loop.
Thanks @dmbates. Exactly what I was looking for. To anyone else
attempting this, you should note that you need to remove the Ptr{Int32}
element corresponding to the lwork
term as well; otherwise ccall()
thinks you haven't passed enough parameters in.
On Tue, Oct 2, 2012 at 9:31 AM, dmbates notifications@github.com wrote:
@andreasnoackjensen https://github.com/andreasnoackjensen You're correct. The function should be changed to
function sytri!(uplo::LapackChar, A::StridedMatrix{$elty}, ipiv::Vector{Int32}) chkstride1(A) chksquare(A) n = size(A,1) work = Array($elty, n) info = Array(Int32, 1) ccall(dlsym(Base.liblapack, $(string(sytri))), Void, (Ptr{Uint8}, Ptr{Int32}, Ptr{$elty}, Ptr{Int32}, Ptr{Int32}, Ptr{$elty}, Ptr{Int32}, Ptr{Int32}), &uplo, &n, A, &stride(A,2), ipiv, work, info) if info[1] != 0 throw(LapackException(info[1])) end A end
In other words, remove lwork, change the dimension of work to n and delete the for i in 1:2 loop.
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/1322#issuecomment-9077291.
@staticfloat Right again. Thanks for catching that.
When I specify
USE_SYSTEM_LAPACK=1 USE_SYSTEM_BLAS=1
on my make commandline, everything compiles fine but when I runmake test
I get the now-familiar:on the
lapack
step. This is becausedsytri2
does not exist in the version of LAPACK that is in Accelerate. (I'm running 10.8.2, so I doubt any other versions of OSX have this in there either).For a quick-and-dirty performance comparison, I ran
mean([peakflops() for x=1:10])
on a Julia compile with OpenBLAS:and a Julia compile with Accelerate:
There's no appreciable difference, in my opinion, so for the sake of focused effort, I propose not littering Julia source with #ifdef's and hanging Accelerate out to dry.