OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.43k stars 1.51k forks source link

GEMM slower than GEMV slower than AXPY equivalent on Intel i5 CPU #528

Open jiahao opened 9 years ago

jiahao commented 9 years ago

Consider the following Julia code (using Julia 0.4-dev):

let
       B = randn(1000, 1000)
       v = randn(1000)
       y = randn(1000); sB = [slice(B, :, j) for j = 1:size(B, 2)]

       @time for i = 1:1000;BLAS.gemm!('N', 'N', 1.0, B, v, 1.0, y);end;

       @time for i = 1:1000;BLAS.gemv!('N', 1.0, B, v, 1.0, y);end;

       @time for i = 1:1000;
           for j = 1:size(B,2)
             BLAS.axpy!(v[j], sB[j], y)
           end
       end
end

On @andreasnoack's machine, a Macbook Pro with i7-4870HQ CPU, GEMM is 4 times slower than GEMV:

elapsed time: 1.084909686 seconds (0 bytes allocated)
elapsed time: 0.2644927 seconds (0 bytes allocated)
elapsed time: 0.321705553 seconds (0 bytes allocated)

On my machine, a Macbook Pro with i5-4258U, I get similar behavior, but also that the AXPY equivalent is the fastest of the 3 computations:

elapsed time: 1.693223657 seconds (0 bytes allocated)
elapsed time: 0.818590556 seconds (0 bytes allocated)
elapsed time: 0.715702898 seconds (0 bytes allocated)

I find the relative performance behaviors surprising.

brada4 commented 8 years ago

Julia JIT chose GEMV (i saw it in profiler) for my pure julia GEMM equivalent with 2 coefficients identical and came out at almost same time as direct gemv call. BLAS DGEMM is no longer in position to optimize this case inside GEMM, as it has no clue 2 values were derived from identical input strings. By calling blas primitives directly you simply try to deny power of Julia's JIT which indeed makes correct choices to play back your task using best BLAS primitives for each case.

1.598799 seconds
0.589480 seconds
1.213341 seconds
0.614269 seconds (8.00 k allocations: 30.884 MB, 0.57% gc time)