JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.54k stars 5.47k forks source link

issym returns false on A.'*A for matrices when syrk is not used #1516

Closed andreasnoack closed 11 years ago

andreasnoack commented 11 years ago

I tried to write som test code for linear algebra but it failed because of ishermitian and issymmetric. For example

julia> for i = 1:20
       mA=randn(5,5)
       println(issym(mA'mA))
       end
false
true
false
false
false
false
false
false
false
false
false
false
false
false
false
false
true
false
false
false

I know that floating point representations are not exact, but I would expect ==(Float,Float) to be aware of this. Right now, the solver \(Matrix,VecOrMat) uses ishermitian, but I guess that it almost never returns true. Should the requirement be more like abs(Float-Float)<sqrt(eps())?

I then had a look at At_mul_B and was surprised by the the requirement of size(A,1)>500 before calling syrk_wrapper when B=A. Why is it so? The tests do not fail on matrices return from syrk so that would to some extend solve the first problem, but it should also be faster.

Finally, is much gained from checking for symmetri/hermitianity before calling the solver in \? It looks like the test consumes the benefit from the more efficient solver.

pao commented 11 years ago

There is an approximate equality test in Julia (isapprox, hiding in https://github.com/JuliaLang/julia/blob/master/extras/nearequal.jl for legacy reasons). I doubt that's the right answer here, but I'll let y'all hash that one out.

staticfloat commented 11 years ago

Personally, I think it is the right answer here. Even something like:

julia> a = randn(); b = randn(); c = randn(); c*(a*b) - b*(a*c)
-1.1102230246251565e-16

Shows that the noise floor of even the simplest floating-point operations is around 10e-16. == certainly shouldn't do anything except test exact equality, but issymetric and ishermitian probably should, as the only time they will be exactly equal is in a deliberately constructed case.

ViralBShah commented 11 years ago

Yes, these functions should check for exact symmetry. One can always have an approximate symmetry test, but that is often application specific, and should be a 5 liner to implement in any application that needs it.

StefanKarpinski commented 11 years ago

So that's a "won't fix"?

dmbates commented 11 years ago

One thing to bear in mind is the possibility of defining types for symmetric/hermitian matrices, including just storing one triangle of the matrix, and triangular matrices. In Eigen (eigen.tuxfamily.org) they provide symmetric/hermitian and triangular "views" of matrices which may be a way to go. That way you don't need to keep performing the same tests checking for special structure when operating on the matrix.

staticfloat commented 11 years ago

One thing to bear in mind is the possibility of defining types for symmetric/hermitian matrices, including just storing one triangle of the matrix, and triangular matrices

I like that, (It would be especially useful for e.g. optimization work, where Hermitian and especially Positive-Semidefinite/Definite matrices can give serious bonuses) it's a much more elegant solution (perhaps this is why MATLAB doesn't have an issymmetric()) but it seems like a lot of work. :)

ViralBShah commented 11 years ago

I thought of won't fix, but I thought it was not broken...

@dmbates We could easily do the same, but then, those symmetric triangular matrices would have to interact with other regular matrices for every operation, and need their own indexing and such. If we can think of a design that works well with the rest of the stuff - either implement the full combinatorial explosion or a meaningful subset, it would definitely help a lot of codes.

ViralBShah commented 11 years ago

@andreasnoackjensen You are right that the symmetry tests are likely to slow down various solves. The \ polyalgorithm has not been performance tested at all.

It is interesting that the threshold is what is killing the symmetry of A'.*A, whereas syrk does the right thing. Perhaps the issue here is to figure this out, rather than change the behaviour of issymmetric. The threshold is in place simply based on some simple performance tests.

I am reopening this issue, and changing the title to reflect this.

andreasnoack commented 11 years ago

@dmbates Maybe an intermediate solution could be just to store some Bools indicating symmetry, definiteness etc. together with the matrix but keep the full storage. Then the usual methods apply to the matrix and you could take advantage of better solvers.

@ViralBShah On my machine gemm_wrapper is only faster for very small problems where the computation time is tiny. The result of gemm depends on the machine and the BLAS which I guess you already know. If I use vecLib the matrix becomes exactly symmetric. On my desktop with Ubuntu OpenBLAS gives exact symmetric matrices but MKL does not. Finally, with OpenBLAS on my Mac it is only when the number of columns is odd, e.g. I get this consistently

julia> for i = 1:10;mA=randn(10,i);println(i, " ", issym(mA'mA));end
1 true
2 true
3 true
4 true
5 false
6 true
7 false
8 true
9 false
10 true

I think it would be a good thing if Julia used symmetric solver for symmetric problems as default, but right now it never happens on my machine which motivated this issue. For eig you also have the problem that the eigenvalues are ordered for the symmetric solver but not for the general solver.

Right now issym is only used one place in arpack.jl and ishermitian is only used in \, eig and ^(Matrix). Functions that very often would be working on floating point arrays, for which ishermitian is unpredictable. Maybe I am asking silly questions, but in which situations do you need to check a floating point matrix for exact hermitianity?

dmbates commented 11 years ago

@andreasnoackjensen I think the representation of Arrays is so deeply embedded in the Julia language that adding a few Boos at that level would be extremely difficult. That is why I suggested having symmetric/hermitian/triangular views of matrices. In other words, create a composite type, say

type SymmetricMatrix{T} <: AbstractMatrix{T}
    A::Matrix{T}
    upper::Bool
end

SymmetricMatrix{T}(A::Matrix{T}) = SymmetricMatrix{T}(A, true)
convert{T}(::Type{Array{T,2}}, A::SymmetricMatrix{T}) = symmetrize!(copy(A.A), A.upper)

size(A::SymmetricMatrix)         = size(A.A)
size(A::SymmetricMatrix, d)      = size(A.A, d)
issym(A::SymmetricMatrix)        = true
eltype{T}(A::SymmetricMatrix{T}) = T

for (fname, elty) in ((:dsymv_,:Float64), (:ssymv_,:Float32),
                      (:zsymv_,:Complex128), (:csymv_,:Complex64))
   @eval begin
       #      SUBROUTINE DSYMV(UPLO,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY)
       #     .. Scalar Arguments ..
       #      DOUBLE PRECISION ALPHA,BETA
       #      INTEGER INCX,INCY,LDA,N
       #      CHARACTER UPLO
       #     .. Array Arguments ..
       #      DOUBLE PRECISION A(LDA,*),X(*),Y(*)
       function _jl_symv!(uplo, alpha::($elty), A::StridedMatrix{$elty}, X::StridedVector{$elty},
                      beta::($elty), Y::StridedVector{$elty})
           m, n = size(A)
           if m != n error("symv!: matrix A must be square") end
           ccall(dlsym(Base.libblas, $(string(fname))), Void,
                 (Ptr{Uint8}, Ptr{Int32}, Ptr{$elty}, Ptr{$elty}, Ptr{Int32},
                 Ptr{$elty}, Ptr{Int32}, Ptr{$elty}, Ptr{$elty}, Ptr{Int32}),
                 &uplo, &n, &alpha, A, &stride(A,2), X, &stride(X,1), &beta, Y, &stride(Y,1))
           Y
       end
       function _jl_symv(uplo, alpha::($elty), A::StridedMatrix{$elty}, X::StridedVector{$elty})
           m, n = size(A)
           if m != n error("symv: matrix A must be square") end
           Y = Array($elty, n)
           ccall(dlsym(Base.libblas, $(string(fname))), Void,
                 (Ptr{Uint8}, Ptr{Int32}, Ptr{$elty}, Ptr{$elty}, Ptr{Int32},
                 Ptr{$elty}, Ptr{Int32}, Ptr{$elty}, Ptr{$elty}, Ptr{Int32}),
                 &uplo, &n, &alpha, A, &stride(A,2), X, &stride(X,1), &0., Y, &stride(Y,1))
           Y
       end
   end
end

function (*){T<:LapackType}(A::SymmetricMatrix{T}, B::StridedVecOrMatrix{T})
    if size(B,1) != size(A,2)
        error("Dimension mismatch: size(A,2) = $(size(A,2)) and size(B,1) = $(size(B,1))")
    end
    if isVector{B} return _jl_symv(A.uplo ? 'U' : 'L', A.A, 1., B) end
    _jl_symm(...)
end

The problem with this approach is that you must then define all the arithmetic operations, as is done in sparse.jl, which produces a combinatorial explosion of the number of methods needed. In some languages you can get around this by having an automatic coercion operator, which in this case would convert the SymmetricMatrix{T} to an Array{T,2} and then invoke the method for that class.

It may be worthwhile playing with the idea and seeing how far it can be pushed. The underlying BLAS and Lapack functions already exist but would need to be added to blas.jl and lapack.jl. The worst case that can happen for the user if the methods are not defined is to get an error message.

dmbates commented 11 years ago

@andreasnoackjensen Regarding issym(mA'mA), I thought that the product mA'mA would be evaluated with the Ac_mul_B function which would detect that the A and B arguments are identical and end up calling Blas.syrk followed by symmetrize!. In other words, I would have though that it was impossible to get an answer that failed issym. Can anyone correct me on why this isn't so?

dmbates commented 11 years ago

Okay, I went back and looked at the code and I see what @andreasnoackjensen was referring to about syrk_wrapper versus gemm_wrapper. For some versions of the Blas, such as openBlas, it turns out that gemm is actually faster than syrk even though gemm calculates twice as many values. This is because the gemm operation is highly optimized. Nevertheless, I think that it would be a good idea to symmetrize the result in At_mul_B when is(A, B). One could argue about exactly how to symmetrize - the simple version copies the upper triangle to the lower triangle. As a statistician I probably should advocate for taking the average of the two values but it seems that it would be best to go with the simple version and save time.

My timings on a four-core processor with openBlas indicate that there is not much of an overhead in calling symmetrize! after Blas.gemm and that combination seems faster than Blas.syrk followed by symmetrize!

ViralBShah commented 11 years ago

@andreasnoackjensen Blas.syrk does not need to be followed by symmetrize, since it does the copying from the upper to the lower part. We just need to do it in the case we are not using syrk, and instead using gemm for performance.

http://www.netlib.org/blas/dsyrk.f

@xianyi Could you tell us why syrk is not as fast as gemm? Is this something that can be addressed quickly?

ViralBShah commented 11 years ago

@dmbates I do like the suggestion of trying to push SymmetricMatrix and see how far it can be taken.

ViralBShah commented 11 years ago

It would be easier to do things like SymmetricMatrix if the array operations code was more generic. There is no reason it can't be - but we do not want to do anything if it affects performance.

Let's do a new branch to work on this, if you guys are keen to try it out. I am all for it, and I am sure others will pitch in as well.

xianyi commented 11 years ago

Hi @ViralBShah ,

We didn't modified the implementation of syrk from GotoBLAS. I just read Goto's paper (High-Performance Implementation of the Level-3 BLAS) and codes. I think this functions is already optimized.

However, syrk is not as fast as gemm. I think the reason is that syrk only accessed the lower or upper triangle matrix in matrix C. Besides calling GEMM kernel, it also calls a kernel to deal with the boundary.

Thank you

Xianyi

dmbates commented 11 years ago

@ViralBShah I don't think you are correct about syrk symmetrizing the result. In fact the docs state explicitly that if uplo is 'U' then the lower triangle of C is not referenced and vice-versa for 'L'. You can also try it.

Not sure if I mentioned it but I am travelling for the next week and may not be able to participate actively in discussions.

andreasnoack commented 11 years ago

Lets try out SymmetricMatrix and see how it works. I can put some work into it.

I agree with @dmbates that syrk needs symmetrize! and also that an easy fix could be to call symmetrize! after gemm.

I did some benchmarking on the operation A'A for square matrices and on my machine gemm is only faster for matrices smaller than 20 (and the difference is tiny) so maybe a cut of at 500 is too large.

ViralBShah commented 11 years ago

@andreasnoackjensen Could you post your perf test code. Perhaps all of us can run it, and then just use a new threshold, which takes symmetrize! into account.

@xianyi Thanks for the note on syrk.

andreasnoack commented 11 years ago

Here is what I did:

simvals = 10:500
resmat=zeros(length(simvals), 3)
for i = 1:length(simvals)
       mA = randn(simvals[i],simvals[i])
       resmat[i,1]=simvals[i]
       resmat[i,2]=median([@elapsed Base.gemm_wrapper('T','N',mA,mA) for j = 1:100])
       resmat[i,3]=median([@elapsed Base.syrk_wrapper('T',mA) for j = 1:100])
end

The timings are very fast and heavily distorted when something else happened. Therefore I used the median in stead the mean.

StefanKarpinski commented 11 years ago

I usually go even further with benchmarks and use the min, although I've sometimes thought that perhaps that's a bit too generous.

ViralBShah commented 11 years ago

Yes, I agree - I tend to use min for this kind of a thing.

Just to be safe, I think one should do this before running the benchmark.

ENV["OPENBLAS_NUM_THREADS"]="1"

ViralBShah commented 11 years ago

Here's a slightly modified version I used. It turns out that gemm + symmetrize loses any real gain. I think we are better off just calling syrk.

for i = 1:length(simvals)
              mA = randn(simvals[i],simvals[i])
              resmat[i,1]=simvals[i]
              resmat[i,2]=min([@elapsed (Base.gemm_wrapper('T','N',mA,mA); symmetrize!(mA)) for j = 1:5])
              resmat[i,3]=min([@elapsed Base.syrk_wrapper('T',mA) for j = 1:5])
       end
ViralBShah commented 11 years ago

@andreasnoackjensen If we just make A'*A always just call syrk or herk - we should be able to avoid the need to call symmetrize!. Could you verify that?

andreasnoack commented 11 years ago

@ViralBShah Blas.syrk and Blas.herk only computes a triangle but the wrapper functions do a (conj) symmetrize! to store the full matrix. Just to avoid confusion

julia> mA=randn(2,2)
2x2 Float64 Array:
 0.338197   -0.608599
 0.0702311   0.560439

julia> Blas.syrk('U','T',1.0,mA)
2x2 Float64 Array:
 0.119309      -0.166466
 2.13043e-314   0.684485

julia> Base.syrk_wrapper('T',mA)
2x2 Float64 Array:
  0.119309  -0.166466
 -0.166466   0.684485

Using min and running single threaded, I get that Blas.syrk is always faster than Blas.gemm but that Blas.syrk+symmetrize! is faster than Blas.gemm only when n>~40. However the difference is so small that I think Blas.syrk+symmetrize! is the better solution because it ensures symmetry.

By the way. I am unable to make OpenBLAS run single threaded when I set ENV["OPENBLAS_NUM_THREADS"]="1". I need to export OPENBLAS_NUM_THREADS=1 in the terminal before launching Julia.

StefanKarpinski commented 11 years ago

By the way. I am unable to make OpenBLAS run single threaded when I set ENV["OPENBLAS_NUM_THREADS"]="1". I need to export OPENBLAS_NUM_THREADS=1 in the terminal before launching Julia.

Yeah, this seems to be checked at load-time by OpenBLAS, so setting it later has no effect. @xianyi?

andreasnoack commented 11 years ago

I have tried to continue the SymmetricMatrix type that @dmbates laid out earlier. It is at the symmat branch of https://github.com/andreasnoackjensen/julia. As an example

julia> mA=randn(5,5)
5x5 Float64 Array:
 -0.607973   -1.65633   -0.954783   -0.212449   -0.0746644
  1.05804     0.825478  -0.0415001  -1.04721    -0.680162 
  0.0692246  -1.34573   -0.244366    0.0101776  -0.367226 
 -0.197326    1.33238    0.266142   -0.794161   -1.24781  
  0.112272    0.276713   1.36922    -0.184204    0.920406 

julia> mAs=mA'mA
5x5 Float64 SymmetricMatrix:
  1.54542    1.55539   0.620866  -0.8421    -0.350108
  1.55539    7.08763   2.60951   -1.63536   -1.35147 
  0.620866   2.60951   2.91865   -0.219761   1.1174  
 -0.8421    -1.63536  -0.219761   1.80652    1.54582 
 -0.350108  -1.35147   1.1174     1.54582    3.00722 

julia> eig(mAs)
([0.295804, 0.832826, 1.36709, 4.49762, 9.37209],
5x5 Float64 Array:
 -0.272328   0.581043   0.729841  -0.0367931  -0.23282 
  0.203881   0.263756  -0.402038   0.0628342  -0.85046 
 -0.437102  -0.572683   0.212512   0.565214   -0.341096
 -0.592851   0.464507  -0.490319   0.353324    0.259828
  0.584585   0.221561   0.141857   0.741885    0.196608)

julia> mAs[3,2]
2.609510896179169

julia> mAs[3]
0.6208655135640481

julia> mA=randn(51,51);mAs1=mA'mA;mAs2=Base.gemm_wrapper('T','N',mA,mA);

julia> issym(mAs2)
false

julia> min([@elapsed eig(mAs1) for i = 1:10])
0.0006330013275146484

julia> min([@elapsed eig(mAs2) for i = 1:10])
0.0016489028930664062

I wondered if it is necessary to have the choice of storing the lower triangle. Many cases could be removed from the code if it wasn't.

ViralBShah commented 11 years ago

@andreasnoackjensen Let's move the SymmetricMatrix discussion to a new issue. Alternatively, you could also make it a pull request, while it is being developed, so that people can comment/follow/etc.

xianyi commented 11 years ago

@StefanKarpinski, You are right. OpenBLAS only check OPENBLAS_NUM_THREADS at load-time.

If you want to control the number of the threads on runtime, please try to call

    void openblas_set_num_threads(int );

Xianyi

ViralBShah commented 11 years ago

@xianyi I noticed that CPU usage is 370% on my Mac. Does that mean that openblas is using 4 threads on my dual core CPU (hyperthreading perhaps?) Does this affect performance?

staticfloat commented 11 years ago

@ViralBShah; if you're seeing 370% it means you have at least 4 logical cores on your machine. Hyperthreading will indeed double your logical cores, you can use something like your Activity Monitor to see how many threads are running.

StefanKarpinski commented 11 years ago

If a BLAS is doing its job, hyperthreading should not work well since a well-tuned BLAS will have each CPU's pipeline full with no bubble for the hyperthreading multiplexer to exploit.

staticfloat commented 11 years ago

This is something that we can test by exporting OPENBLAS_NUM_THREADS=X for X = 1,2,4 and checking peakflops (or something more useful as a measuring rod of performance) as I don't think tools like top or the Activity Monitor are going to give useful results here. (I would, but my machines don't have HT)

xianyi commented 11 years ago

Hi @ViralBShah ,

OpenBLAS can avoid using HT on Linux, which means don't run the thread on the logical core. However, it didn't support this feature on Mac. Thus, you must set the number of the thread.

Xianyi

ViralBShah commented 11 years ago

Here's what I get for peakflops(). I would have expected to see 40 GFlops with 2 processors, multiplying 2000x2000 matrices. @xianyi would this be some kind of affinity issue on mac?

OPENBLAS_NUM_THREADS=1 20 GFlops

OPENBLAS_NUM_THREADS=2 32 GFlops

OPENBLAS_NUM_THREADS=3 32 GFlops

OPENBLAS_NUM_THREADS=4 20 GFlops

xianyi commented 11 years ago

@ViralBShah , what is your CPU on your Mac?

ViralBShah commented 11 years ago

Processor Name: Intel Core i5 (openblas compiles sandybridge kernel) Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 3 MB Memory: 8 GB

xianyi commented 11 years ago

@ViralBShah, the peak performance is

2.4GHz * 2 pipelines * 4 width of SIMD * 2 cores = 38.4 GFLOPS

Thus, the efficiency of OpenBLAS DGEMM is about 83%. In our experiment, the efficiency is about 89% On the mainstream sandy bridge CPU.

Could you try arger matrices? For example, 5000 x 5000.

Xianyi

ViralBShah commented 11 years ago

@xianyi I cannot get more than 32 GFLOPS, even with 5000x5000. Perhaps the laptop version has different characteristics (I am on a macbook pro)?

JeffBezanson commented 11 years ago

Ok so can we just remove the size cutoff and always call syrk and herk?

ViralBShah commented 11 years ago

Yes