GEMM slower than GEMV slower than AXPY equivalent on Intel i5 CPU

OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

BSD 3-Clause "New" or "Revised" License

6.43k stars 1.51k forks source link

Consider the following Julia code (using Julia 0.4-dev):

let
       B = randn(1000, 1000)
       v = randn(1000)
       y = randn(1000); sB = [slice(B, :, j) for j = 1:size(B, 2)]

       @time for i = 1:1000;BLAS.gemm!('N', 'N', 1.0, B, v, 1.0, y);end;

       @time for i = 1:1000;BLAS.gemv!('N', 1.0, B, v, 1.0, y);end;

       @time for i = 1:1000;
           for j = 1:size(B,2)
             BLAS.axpy!(v[j], sB[j], y)
           end
       end
end

On @andreasnoack's machine, a Macbook Pro with i7-4870HQ CPU, GEMM is 4 times slower than GEMV:

elapsed time: 1.084909686 seconds (0 bytes allocated)
elapsed time: 0.2644927 seconds (0 bytes allocated)
elapsed time: 0.321705553 seconds (0 bytes allocated)

On my machine, a Macbook Pro with i5-4258U, I get similar behavior, but also that the AXPY equivalent is the fastest of the 3 computations:

elapsed time: 1.693223657 seconds (0 bytes allocated)
elapsed time: 0.818590556 seconds (0 bytes allocated)
elapsed time: 0.715702898 seconds (0 bytes allocated)

I find the relative performance behaviors surprising.

OpenMathLib / OpenBLAS

GEMM slower than GEMV slower than AXPY equivalent on Intel i5 CPU #528