OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.32k stars 1.49k forks source link

OpenBLAS 'slower' than FOR LOOP for Matrix Multiplication ? #1636

Closed cefengxu closed 5 years ago

cefengxu commented 6 years ago

comparing the speed between traditional meth and openBLAS for Matrix Multiplication , however , obtain result seems kindar confusing.

testing code below

    double start , finish ;
    int NUM = 100;
    double a1[3*2] = {  1, 2,  /* CblasRowMajor */
                        3, 4,
                        5, 6
                };
    double b[2*3] = {  7,8, 9,   /* CblasRowMajor */
                       1, 2, 3
                };

    double c[3*3] = {0,0,0,0,0,0,0,0,0};

    start = clock();

    for( int k=0 ; k<NUM ;k++)
    {
        for(int i=0 ; i<3 ; i++)
        {
            for(int j=0;j<3;j++){
                for (int p = 0; p < 2 ; ++p) 
                    c[i*3 +j] = c[i*3 +j] + a1[i*2+p]*b[p*3+j];         
            }
        }
    }

    finish = clock();
    printf("FOR LOOP : %f seconds\n",(finish-start)/1000);

    start = clock();

    for( int k=0 ; k<NUM ;k++)
    {
        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 3,    3,    5, 1.0,   a1,   2, b, 3,  0.0, c,  3);
    }
    finish = clock();
    printf("OpenBLAS : %f seconds\n",(finish-start)/1000);

i run the code on PC and Android Platform , and then get the result as follow:

// run on PC ( build with gcc )

FOR LOOP : 0.012000 seconds OpenBLAS : 0.186000 seconds

FOR LOOP : 0.008000 seconds OpenBLAS : 0.130000 seconds

// run on ANDROID ( build with clang )

FOR LOOP : 0.006000 sec OpenBLAS : 0.303000 sec

FOR LOOP : 0.012000 sec OpenBLAS : 0.214000 sec

sandwichmaker commented 6 years ago

This is not a fair comparison.

  1. This is a trivially small matrix.
  2. OpenBLAS routines do a bunch of setup and error checking. That has overhead.
  3. Plus in your matrix multiply loop, you are exposing the size of the matrix to the compiler, which should be able to completely unroll it. OpenBLAS on the other hand is built without knowing the size of the matrix, optimizing for memory access.
martin-frbg commented 6 years ago

I suspect using cblas_dgemm instead of dgemm (with matrix layout adapted to fortran calling conventions) carries an additional overhead (see https://www.christophlassner.de/using-blas-from-c-with-row-major-data.html ). Also you may want to retry with a current snapshot of the "develop" branch which should have significantly reduced thread startup times. (From #1632 you were building 0.2.20 yesterday, if you built for generic ARMV8 rather than CORTEXA57 this will have used the unoptimized C kernels. Current 0.3.0 version uses most of the optimized assembly kernels in generic ARMV8 mode as well)

cefengxu commented 6 years ago

thx all , i will try later and output same replies~

cefengxu commented 6 years ago

refering to "I suspect using cblas_dgemm instead of dgemm" , however , seems no dgemm api can be found for dev-branch

i rebuild from dev-branch and try again , but the result still no good. May be the matix should be more large,

martin-frbg commented 6 years ago

You will probably need to call it as dgemm_ , but if your matrix size is really just 3x2 (and in particular in the example you gave above, with the compiler able to unroll the loop) the simple loop will still win. (You should see much less overhead for the thread creation with the develop branch, so the results should be better now than with 0.2.20 - or is that not the case ?)

brada4 commented 6 years ago

"test code" already "optimizes" out scalar constants

*  DGEMM  performs one of the matrix-matrix operations
*     C := alpha*op( A )*op( B ) + beta*C,
*  where  op( X ) is one of
*     op( X ) = X   or   op( X ) = X',
*  alpha and beta are scalars, and A, B and C are matrices, with op( A )
*  an m by k matrix,  op( B )  a  k by n matrix and  C an m by n matrix.

Besides compiler knows that input is static and can optimize out the 100-fold loop (icl will do it, maybe clang too), besides inlining and subsequently vectorising inner loops, of which neither can happen with external library call in other part.

Actually there is a case that reference BLAS is faster for small samples due to no parallelism setup whatsoever in later.

martin-frbg commented 5 years ago

You could try https://github.com/giaf/blasfeo if your matrix sizes are always going to be very small.(Still not likely to beat the trivial optimizations possible to the compiler in your original example, as has been pointed out already)