Closed andreasnoack closed 11 years ago
There is an approximate equality test in Julia (isapprox
, hiding in https://github.com/JuliaLang/julia/blob/master/extras/nearequal.jl for legacy reasons). I doubt that's the right answer here, but I'll let y'all hash that one out.
Personally, I think it is the right answer here. Even something like:
julia> a = randn(); b = randn(); c = randn(); c*(a*b) - b*(a*c)
-1.1102230246251565e-16
Shows that the noise floor of even the simplest floating-point operations is around 10e-16
. ==
certainly shouldn't do anything except test exact equality, but issymetric
and ishermitian
probably should, as the only time they will be exactly equal is in a deliberately constructed case.
Yes, these functions should check for exact symmetry. One can always have an approximate symmetry test, but that is often application specific, and should be a 5 liner to implement in any application that needs it.
So that's a "won't fix"?
One thing to bear in mind is the possibility of defining types for symmetric/hermitian matrices, including just storing one triangle of the matrix, and triangular matrices. In Eigen (eigen.tuxfamily.org) they provide symmetric/hermitian and triangular "views" of matrices which may be a way to go. That way you don't need to keep performing the same tests checking for special structure when operating on the matrix.
One thing to bear in mind is the possibility of defining types for symmetric/hermitian matrices, including just storing one triangle of the matrix, and triangular matrices
I like that, (It would be especially useful for e.g. optimization work, where Hermitian and especially Positive-Semidefinite/Definite matrices can give serious bonuses) it's a much more elegant solution (perhaps this is why MATLAB doesn't have an issymmetric()
) but it seems like a lot of work. :)
I thought of won't fix, but I thought it was not broken...
@dmbates We could easily do the same, but then, those symmetric triangular matrices would have to interact with other regular matrices for every operation, and need their own indexing and such. If we can think of a design that works well with the rest of the stuff - either implement the full combinatorial explosion or a meaningful subset, it would definitely help a lot of codes.
@andreasnoackjensen You are right that the symmetry tests are likely to slow down various solves. The \
polyalgorithm has not been performance tested at all.
It is interesting that the threshold is what is killing the symmetry of A'.*A, whereas syrk
does the right thing. Perhaps the issue here is to figure this out, rather than change the behaviour of issymmetric
. The threshold is in place simply based on some simple performance tests.
I am reopening this issue, and changing the title to reflect this.
@dmbates Maybe an intermediate solution could be just to store some Bools indicating symmetry, definiteness etc. together with the matrix but keep the full storage. Then the usual methods apply to the matrix and you could take advantage of better solvers.
@ViralBShah On my machine gemm_wrapper
is only faster for very small problems where the computation time is tiny. The result of gemm
depends on the machine and the BLAS which I guess you already know. If I use vecLib the matrix becomes exactly symmetric. On my desktop with Ubuntu OpenBLAS gives exact symmetric matrices but MKL does not. Finally, with OpenBLAS on my Mac it is only when the number of columns is odd, e.g. I get this consistently
julia> for i = 1:10;mA=randn(10,i);println(i, " ", issym(mA'mA));end
1 true
2 true
3 true
4 true
5 false
6 true
7 false
8 true
9 false
10 true
I think it would be a good thing if Julia used symmetric solver for symmetric problems as default, but right now it never happens on my machine which motivated this issue. For eig
you also have the problem that the eigenvalues are ordered for the symmetric solver but not for the general solver.
Right now issym
is only used one place in arpack.jl and ishermitian
is only used in \
, eig
and ^(Matrix)
. Functions that very often would be working on floating point arrays, for which ishermitian
is unpredictable. Maybe I am asking silly questions, but in which situations do you need to check a floating point matrix for exact hermitianity?
@andreasnoackjensen I think the representation of Array
s is so deeply embedded in the Julia language that adding a few Boos at that level would be extremely difficult. That is why I suggested having symmetric/hermitian/triangular views of matrices. In other words, create a composite type, say
type SymmetricMatrix{T} <: AbstractMatrix{T}
A::Matrix{T}
upper::Bool
end
SymmetricMatrix{T}(A::Matrix{T}) = SymmetricMatrix{T}(A, true)
convert{T}(::Type{Array{T,2}}, A::SymmetricMatrix{T}) = symmetrize!(copy(A.A), A.upper)
size(A::SymmetricMatrix) = size(A.A)
size(A::SymmetricMatrix, d) = size(A.A, d)
issym(A::SymmetricMatrix) = true
eltype{T}(A::SymmetricMatrix{T}) = T
for (fname, elty) in ((:dsymv_,:Float64), (:ssymv_,:Float32),
(:zsymv_,:Complex128), (:csymv_,:Complex64))
@eval begin
# SUBROUTINE DSYMV(UPLO,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY)
# .. Scalar Arguments ..
# DOUBLE PRECISION ALPHA,BETA
# INTEGER INCX,INCY,LDA,N
# CHARACTER UPLO
# .. Array Arguments ..
# DOUBLE PRECISION A(LDA,*),X(*),Y(*)
function _jl_symv!(uplo, alpha::($elty), A::StridedMatrix{$elty}, X::StridedVector{$elty},
beta::($elty), Y::StridedVector{$elty})
m, n = size(A)
if m != n error("symv!: matrix A must be square") end
ccall(dlsym(Base.libblas, $(string(fname))), Void,
(Ptr{Uint8}, Ptr{Int32}, Ptr{$elty}, Ptr{$elty}, Ptr{Int32},
Ptr{$elty}, Ptr{Int32}, Ptr{$elty}, Ptr{$elty}, Ptr{Int32}),
&uplo, &n, &alpha, A, &stride(A,2), X, &stride(X,1), &beta, Y, &stride(Y,1))
Y
end
function _jl_symv(uplo, alpha::($elty), A::StridedMatrix{$elty}, X::StridedVector{$elty})
m, n = size(A)
if m != n error("symv: matrix A must be square") end
Y = Array($elty, n)
ccall(dlsym(Base.libblas, $(string(fname))), Void,
(Ptr{Uint8}, Ptr{Int32}, Ptr{$elty}, Ptr{$elty}, Ptr{Int32},
Ptr{$elty}, Ptr{Int32}, Ptr{$elty}, Ptr{$elty}, Ptr{Int32}),
&uplo, &n, &alpha, A, &stride(A,2), X, &stride(X,1), &0., Y, &stride(Y,1))
Y
end
end
end
function (*){T<:LapackType}(A::SymmetricMatrix{T}, B::StridedVecOrMatrix{T})
if size(B,1) != size(A,2)
error("Dimension mismatch: size(A,2) = $(size(A,2)) and size(B,1) = $(size(B,1))")
end
if isVector{B} return _jl_symv(A.uplo ? 'U' : 'L', A.A, 1., B) end
_jl_symm(...)
end
The problem with this approach is that you must then define all the arithmetic operations, as is done in sparse.jl
, which produces a combinatorial explosion of the number of methods needed. In some languages you can get around this by having an automatic coercion operator, which in this case would convert the SymmetricMatrix{T} to an Array{T,2} and then invoke the method for that class.
It may be worthwhile playing with the idea and seeing how far it can be pushed. The underlying BLAS and Lapack functions already exist but would need to be added to blas.jl
and lapack.jl
. The worst case that can happen for the user if the methods are not defined is to get an error message.
@andreasnoackjensen Regarding issym(mA'mA)
, I thought that the product mA'mA
would be evaluated with the Ac_mul_B
function which would detect that the A
and B
arguments are identical and end up calling Blas.syrk
followed by symmetrize!
. In other words, I would have though that it was impossible to get an answer that failed issym
. Can anyone correct me on why this isn't so?
Okay, I went back and looked at the code and I see what @andreasnoackjensen was referring to about syrk_wrapper versus gemm_wrapper. For some versions of the Blas, such as openBlas, it turns out that gemm is actually faster than syrk even though gemm calculates twice as many values. This is because the gemm operation is highly optimized. Nevertheless, I think that it would be a good idea to symmetrize the result in At_mul_B when is(A, B)
. One could argue about exactly how to symmetrize - the simple version copies the upper triangle to the lower triangle. As a statistician I probably should advocate for taking the average of the two values but it seems that it would be best to go with the simple version and save time.
My timings on a four-core processor with openBlas indicate that there is not much of an overhead in calling symmetrize!
after Blas.gemm
and that combination seems faster than Blas.syrk
followed by symmetrize!
@andreasnoackjensen Blas.syrk
does not need to be followed by symmetrize
, since it does the copying from the upper to the lower part. We just need to do it in the case we are not using syrk
, and instead using gemm
for performance.
http://www.netlib.org/blas/dsyrk.f
@xianyi Could you tell us why syrk
is not as fast as gemm
? Is this something that can be addressed quickly?
@dmbates I do like the suggestion of trying to push SymmetricMatrix
and see how far it can be taken.
It would be easier to do things like SymmetricMatrix
if the array operations code was more generic. There is no reason it can't be - but we do not want to do anything if it affects performance.
Let's do a new branch to work on this, if you guys are keen to try it out. I am all for it, and I am sure others will pitch in as well.
Hi @ViralBShah ,
We didn't modified the implementation of syrk from GotoBLAS. I just read Goto's paper (High-Performance Implementation of the Level-3 BLAS) and codes. I think this functions is already optimized.
However, syrk is not as fast as gemm. I think the reason is that syrk only accessed the lower or upper triangle matrix in matrix C. Besides calling GEMM kernel, it also calls a kernel to deal with the boundary.
Thank you
Xianyi
@ViralBShah I don't think you are correct about syrk symmetrizing the result. In fact the docs state explicitly that if uplo is 'U' then the lower triangle of C is not referenced and vice-versa for 'L'. You can also try it.
Not sure if I mentioned it but I am travelling for the next week and may not be able to participate actively in discussions.
Lets try out SymmetricMatrix
and see how it works. I can put some work into it.
I agree with @dmbates that syrk
needs symmetrize!
and also that an easy fix could be to call symmetrize!
after gemm
.
I did some benchmarking on the operation A'A
for square matrices and on my machine gemm
is only faster for matrices smaller than 20 (and the difference is tiny) so maybe a cut of at 500 is too large.
@andreasnoackjensen Could you post your perf test code. Perhaps all of us can run it, and then just use a new threshold, which takes symmetrize!
into account.
@xianyi Thanks for the note on syrk
.
Here is what I did:
simvals = 10:500
resmat=zeros(length(simvals), 3)
for i = 1:length(simvals)
mA = randn(simvals[i],simvals[i])
resmat[i,1]=simvals[i]
resmat[i,2]=median([@elapsed Base.gemm_wrapper('T','N',mA,mA) for j = 1:100])
resmat[i,3]=median([@elapsed Base.syrk_wrapper('T',mA) for j = 1:100])
end
The timings are very fast and heavily distorted when something else happened. Therefore I used the median in stead the mean.
I usually go even further with benchmarks and use the min, although I've sometimes thought that perhaps that's a bit too generous.
Yes, I agree - I tend to use min for this kind of a thing.
Just to be safe, I think one should do this before running the benchmark.
ENV["OPENBLAS_NUM_THREADS"]="1"
Here's a slightly modified version I used. It turns out that gemm + symmetrize loses any real gain. I think we are better off just calling syrk
.
for i = 1:length(simvals)
mA = randn(simvals[i],simvals[i])
resmat[i,1]=simvals[i]
resmat[i,2]=min([@elapsed (Base.gemm_wrapper('T','N',mA,mA); symmetrize!(mA)) for j = 1:5])
resmat[i,3]=min([@elapsed Base.syrk_wrapper('T',mA) for j = 1:5])
end
@andreasnoackjensen If we just make A'*A
always just call syrk
or herk
- we should be able to avoid the need to call symmetrize!
. Could you verify that?
@ViralBShah Blas.syrk
and Blas.herk
only computes a triangle but the wrapper functions do a (conj) symmetrize!
to store the full matrix. Just to avoid confusion
julia> mA=randn(2,2)
2x2 Float64 Array:
0.338197 -0.608599
0.0702311 0.560439
julia> Blas.syrk('U','T',1.0,mA)
2x2 Float64 Array:
0.119309 -0.166466
2.13043e-314 0.684485
julia> Base.syrk_wrapper('T',mA)
2x2 Float64 Array:
0.119309 -0.166466
-0.166466 0.684485
Using min
and running single threaded, I get that Blas.syrk
is always faster than Blas.gemm
but that Blas.syrk+symmetrize!
is faster than Blas.gemm
only when n>~40. However the difference is so small that I think Blas.syrk+symmetrize!
is the better solution because it ensures symmetry.
By the way. I am unable to make OpenBLAS run single threaded when I set ENV["OPENBLAS_NUM_THREADS"]="1"
. I need to export OPENBLAS_NUM_THREADS=1
in the terminal before launching Julia.
By the way. I am unable to make OpenBLAS run single threaded when I set ENV["OPENBLAS_NUM_THREADS"]="1". I need to export OPENBLAS_NUM_THREADS=1 in the terminal before launching Julia.
Yeah, this seems to be checked at load-time by OpenBLAS, so setting it later has no effect. @xianyi?
I have tried to continue the SymmetricMatrix
type that @dmbates laid out earlier. It is at the symmat branch of https://github.com/andreasnoackjensen/julia. As an example
julia> mA=randn(5,5)
5x5 Float64 Array:
-0.607973 -1.65633 -0.954783 -0.212449 -0.0746644
1.05804 0.825478 -0.0415001 -1.04721 -0.680162
0.0692246 -1.34573 -0.244366 0.0101776 -0.367226
-0.197326 1.33238 0.266142 -0.794161 -1.24781
0.112272 0.276713 1.36922 -0.184204 0.920406
julia> mAs=mA'mA
5x5 Float64 SymmetricMatrix:
1.54542 1.55539 0.620866 -0.8421 -0.350108
1.55539 7.08763 2.60951 -1.63536 -1.35147
0.620866 2.60951 2.91865 -0.219761 1.1174
-0.8421 -1.63536 -0.219761 1.80652 1.54582
-0.350108 -1.35147 1.1174 1.54582 3.00722
julia> eig(mAs)
([0.295804, 0.832826, 1.36709, 4.49762, 9.37209],
5x5 Float64 Array:
-0.272328 0.581043 0.729841 -0.0367931 -0.23282
0.203881 0.263756 -0.402038 0.0628342 -0.85046
-0.437102 -0.572683 0.212512 0.565214 -0.341096
-0.592851 0.464507 -0.490319 0.353324 0.259828
0.584585 0.221561 0.141857 0.741885 0.196608)
julia> mAs[3,2]
2.609510896179169
julia> mAs[3]
0.6208655135640481
julia> mA=randn(51,51);mAs1=mA'mA;mAs2=Base.gemm_wrapper('T','N',mA,mA);
julia> issym(mAs2)
false
julia> min([@elapsed eig(mAs1) for i = 1:10])
0.0006330013275146484
julia> min([@elapsed eig(mAs2) for i = 1:10])
0.0016489028930664062
I wondered if it is necessary to have the choice of storing the lower triangle. Many cases could be removed from the code if it wasn't.
@andreasnoackjensen Let's move the SymmetricMatrix discussion to a new issue. Alternatively, you could also make it a pull request, while it is being developed, so that people can comment/follow/etc.
@StefanKarpinski, You are right. OpenBLAS only check OPENBLAS_NUM_THREADS at load-time.
If you want to control the number of the threads on runtime, please try to call
void openblas_set_num_threads(int );
Xianyi
@xianyi I noticed that CPU usage is 370% on my Mac. Does that mean that openblas is using 4 threads on my dual core CPU (hyperthreading perhaps?) Does this affect performance?
@ViralBShah; if you're seeing 370% it means you have at least 4 logical cores on your machine. Hyperthreading will indeed double your logical cores, you can use something like your Activity Monitor to see how many threads are running.
If a BLAS is doing its job, hyperthreading should not work well since a well-tuned BLAS will have each CPU's pipeline full with no bubble for the hyperthreading multiplexer to exploit.
This is something that we can test by exporting OPENBLAS_NUM_THREADS=X
for X = 1,2,4 and checking peakflops
(or something more useful as a measuring rod of performance) as I don't think tools like top
or the Activity Monitor are going to give useful results here. (I would, but my machines don't have HT)
Hi @ViralBShah ,
OpenBLAS can avoid using HT on Linux, which means don't run the thread on the logical core. However, it didn't support this feature on Mac. Thus, you must set the number of the thread.
Xianyi
Here's what I get for peakflops(). I would have expected to see 40 GFlops with 2 processors, multiplying 2000x2000 matrices. @xianyi would this be some kind of affinity issue on mac?
OPENBLAS_NUM_THREADS=1 20 GFlops
OPENBLAS_NUM_THREADS=2 32 GFlops
OPENBLAS_NUM_THREADS=3 32 GFlops
OPENBLAS_NUM_THREADS=4 20 GFlops
@ViralBShah , what is your CPU on your Mac?
Processor Name: Intel Core i5 (openblas compiles sandybridge kernel) Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 3 MB Memory: 8 GB
@ViralBShah, the peak performance is
2.4GHz * 2 pipelines * 4 width of SIMD * 2 cores = 38.4 GFLOPS
Thus, the efficiency of OpenBLAS DGEMM is about 83%. In our experiment, the efficiency is about 89% On the mainstream sandy bridge CPU.
Could you try arger matrices? For example, 5000 x 5000.
Xianyi
@xianyi I cannot get more than 32 GFLOPS, even with 5000x5000. Perhaps the laptop version has different characteristics (I am on a macbook pro)?
Ok so can we just remove the size cutoff and always call syrk and herk?
Yes
I tried to write som test code for linear algebra but it failed because of
ishermitian
andissymmetric
. For exampleI know that floating point representations are not exact, but I would expect
==(Float,Float)
to be aware of this. Right now, the solver\(Matrix,VecOrMat)
usesishermitian
, but I guess that it almost never returnstrue
. Should the requirement be more likeabs(Float-Float)<sqrt(eps())
?I then had a look at
At_mul_B
and was surprised by the the requirement ofsize(A,1)>500
before callingsyrk_wrapper
whenB=A
. Why is it so? The tests do not fail on matrices return fromsyrk
so that would to some extend solve the first problem, but it should also be faster.Finally, is much gained from checking for symmetri/hermitianity before calling the solver in
\
? It looks like the test consumes the benefit from the more efficient solver.