Open jiahao opened 9 years ago
Julia JIT chose GEMV (i saw it in profiler) for my pure julia GEMM equivalent with 2 coefficients identical and came out at almost same time as direct gemv call. BLAS DGEMM is no longer in position to optimize this case inside GEMM, as it has no clue 2 values were derived from identical input strings. By calling blas primitives directly you simply try to deny power of Julia's JIT which indeed makes correct choices to play back your task using best BLAS primitives for each case.
1.598799 seconds
0.589480 seconds
1.213341 seconds
0.614269 seconds (8.00 k allocations: 30.884 MB, 0.57% gc time)
Consider the following Julia code (using Julia 0.4-dev):
On @andreasnoack's machine, a Macbook Pro with i7-4870HQ CPU, GEMM is 4 times slower than GEMV:
On my machine, a Macbook Pro with i5-4258U, I get similar behavior, but also that the AXPY equivalent is the fastest of the 3 computations:
I find the relative performance behaviors surprising.