Closed cefengxu closed 5 years ago
This is not a fair comparison.
I suspect using cblas_dgemm instead of dgemm (with matrix layout adapted to fortran calling conventions) carries an additional overhead (see https://www.christophlassner.de/using-blas-from-c-with-row-major-data.html ). Also you may want to retry with a current snapshot of the "develop" branch which should have significantly reduced thread startup times. (From #1632 you were building 0.2.20 yesterday, if you built for generic ARMV8 rather than CORTEXA57 this will have used the unoptimized C kernels. Current 0.3.0 version uses most of the optimized assembly kernels in generic ARMV8 mode as well)
thx all , i will try later and output same replies~
refering to "I suspect using cblas_dgemm instead of dgemm" , however , seems no dgemm api can be found for dev-branch
i rebuild from dev-branch and try again , but the result still no good. May be the matix should be more large,
You will probably need to call it as dgemm_ , but if your matrix size is really just 3x2 (and in particular in the example you gave above, with the compiler able to unroll the loop) the simple loop will still win. (You should see much less overhead for the thread creation with the develop branch, so the results should be better now than with 0.2.20 - or is that not the case ?)
"test code" already "optimizes" out scalar constants
* DGEMM performs one of the matrix-matrix operations
* C := alpha*op( A )*op( B ) + beta*C,
* where op( X ) is one of
* op( X ) = X or op( X ) = X',
* alpha and beta are scalars, and A, B and C are matrices, with op( A )
* an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.
Besides compiler knows that input is static and can optimize out the 100-fold loop (icl will do it, maybe clang too), besides inlining and subsequently vectorising inner loops, of which neither can happen with external library call in other part.
Actually there is a case that reference BLAS is faster for small samples due to no parallelism setup whatsoever in later.
You could try https://github.com/giaf/blasfeo if your matrix sizes are always going to be very small.(Still not likely to beat the trivial optimizations possible to the compiler in your original example, as has been pointed out already)
comparing the speed between traditional meth and openBLAS for Matrix Multiplication , however , obtain result seems kindar confusing.
testing code below
i run the code on PC and Android Platform , and then get the result as follow: