See #355 for the background. I am now applying the same methodology to the matrix multiplication benchmark:

https://juliasimd.github.io/LoopVectorization.jl/stable/examples/matrix_multiplication/

I modified the kernel to not zero C, to simplify things a bit:

julia> N = 64
64

julia> A = rand(N,N); B = rand(N,N); C = rand(N,N);

julia> function A_mul_B2!(C, A, B)
           @turbo for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
               for k ∈ indices((A,B), (2,1))
                   C[m,n] += A[m,k] * B[k,n]
               end
           end
       end
A_mul_B2! (generic function with 1 method)

julia> @benchmark A_mul_B2!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.708 μs …  47.333 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.294 μs ± 717.909 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▂  ▁                        ▂  ▆   █  ▆  ▄  ▃  ▁       ▂
  ▃▁▁█▁▁█▁▁█▁▁▇▁▁▅▁▁▁▄▁▁▄▁▁▃▁▁▁▁▁▁▁▁█▁▁█▁▁▁█▁▁█▁▁█▁▁█▁▁█▁▁▆▁▁▄ █
  23.7 μs       Histogram: log(frequency) by time      24.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> cpu_freq = 3.2e9 # Ghz
3.2e9

julia> 23.708e-6 * cpu_freq / N^3
0.289404296875

In C++ I tested two kernels:

void gemm_mnk(double* C, double* A, double* B, long M, long K, long N){
  for (long m = 0; m < M; m++){
    for (long n = 0; n < N; n++){
      for (long k = 0; k < K; k++){
        C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}

void gemm_nkm(double* C, double* A, double* B, long M, long K, long N){
   for (long n = 0; n < N; n++){
    for (long k = 0; k < K; k++){
      for (long m = 0; m < M; m++){
        C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}

For N=64, I am getting:

gemm_mnk: 1.78319
gemm_nkm: 0.21441

Aren't these two kernels equivalent? They seem to operate on the same matrix layout and return exactly the same answer, they just differ in loop ordering.

The inner loop from gemm_nkm as generated by Clang seems to be:

LBB7_5:                                 ;   Parent Loop BB7_3 Depth=1
                                        ;     Parent Loop BB7_4 Depth=2
                                        ; =>    This Inner Loop Header: Depth=3
    ldr d0, [x13, x14]
    ldp q1, q2, [x4, #-256]
    fmla.2d v10, v1, v0[0]
    fmla.2d v9, v2, v0[0]
    ldp q1, q2, [x4, #-224]
    fmla.2d v8, v1, v0[0]
    fmla.2d v31, v2, v0[0]
    ldp q1, q2, [x4, #-192]
    fmla.2d v30, v1, v0[0]
    fmla.2d v29, v2, v0[0]
    ldp q1, q2, [x4, #-160]
    fmla.2d v28, v1, v0[0]
    fmla.2d v27, v2, v0[0]
    ldp q1, q2, [x4, #-128]
    fmla.2d v26, v1, v0[0]
    fmla.2d v25, v2, v0[0]
    ldp q1, q2, [x4, #-96]
    fmla.2d v24, v1, v0[0]
    fmla.2d v23, v2, v0[0]
    ldp q1, q2, [x4, #-64]
    fmla.2d v22, v1, v0[0]
    fmla.2d v21, v2, v0[0]
    ldp q1, q2, [x4, #-32]
    fmla.2d v20, v1, v0[0]
    fmla.2d v19, v2, v0[0]
    ldp q1, q2, [x4]
    fmla.2d v18, v1, v0[0]
    fmla.2d v17, v2, v0[0]
    ldp q1, q2, [x4, #32]
    fmla.2d v16, v1, v0[0]
    fmla.2d v7, v2, v0[0]
    ldp q1, q2, [x4, #64]
    fmla.2d v6, v1, v0[0]
    fmla.2d v5, v2, v0[0]
    ldp q1, q2, [x4, #96]
    fmla.2d v4, v1, v0[0]
    ldp q1, q3, [sp, #96]               ; 32-byte Folded Reload
    fmla.2d v1, v2, v0[0]
    str q1, [sp, #96]                   ; 16-byte Folded Spill
    ldp q1, q2, [x4, #128]
    fmla.2d v3, v1, v0[0]
    ldr q1, [sp, #128]                  ; 16-byte Folded Reload
    fmla.2d v1, v2, v0[0]
    stp q3, q1, [sp, #112]              ; 32-byte Folded Spill
    ldp q1, q2, [x4, #160]
    fmla.2d v12, v1, v0[0]
    fmla.2d v11, v2, v0[0]
    ldp q1, q2, [x4, #192]
    fmla.2d v13, v1, v0[0]
    fmla.2d v15, v2, v0[0]
    ldp q1, q2, [x4, #224]
    fmla.2d v14, v1, v0[0]
    ldr q1, [sp, #144]                  ; 16-byte Folded Reload
    fmla.2d v1, v2, v0[0]
    str q1, [sp, #144]                  ; 16-byte Folded Spill
    add x14, x14, #8                    ; =8
    add x4, x4, #512                    ; =512
    cmp x14, #512                       ; =512
    b.ne    LBB7_5

Theoretical Peak Performance

The Clang results hint that gemm_nkm is probably exactly written to be optimal and to vectorize over m:

   for (long n = 0; n < N; n++){
    for (long k = 0; k < K; k++){
      for (long m = 0; m < M; m++){
        C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }

In this inner loop, vectorized over m, we have one memory write (C[m + n*M]), one memory read (A[m + k*M]) and one fma. Everything else is a constant. On M1: memory write is 0.25 per double, memory read is 0.1666 per double. fma is 0.125 per double. So if we only had the inner loop, the bottleneck would be memory write. However, the writes to C[m + n*M] can be reused in the loop over k. The same with memory reads, always only two loops out of three access it, but the fma is used by all the loops.

Another way to look at it is that the complexity is O(N^3), we have N^3 fma, but only N^2 memory writes (C), 2*N^2 memory reads (A and B).

So the bottleneck is actually only fma for large enough N.

The theoretical peak performance for the loop body on M1 is thus 0.125 clock cycles per double.

The theoretical peak performance for the loop body on M1 is thus 0.125 clock cycles per double.

I get fairly close to this on the M1:

julia> @benchmark A_mul_B2!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.083 μs …  22.625 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.167 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.189 μs ± 306.488 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         ▁█                                     
  ▂▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▅▄▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  11.1 μs         Histogram: frequency by time         11.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> cpu_freq = 3.2e9 # Ghz
3.2e9

julia> 11.083e-6 * cpu_freq / N^3
0.13529052734375

julia> versioninfo()
Julia Version 1.8.0-DEV.901
Commit 47255f9fe8* (2021-11-06 19:56 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.1.0)
  CPU: Apple M1

Rosetta cannot use FMA instructions, because Westmere x86 CPUs do not have them, so it is limited to at best half the potential performance achievable.

Also, with LV, you can write the loops is any order you'd like, and it should interchange them as necessary.

The Clang generated assembly looks pretty good. But it's still slow. 0.135 is decent, that is 92% peak.

I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think is too small) I don't get the peak.

Are you able to get 100% of the peak for large enough matrices in Julia, whether M1 or Intel? I don't care which platform.

Here is the inner loop LV generates for Neon:

L328:
    ld1r    { v10.2d }, [x20], #8
    ldp q0, q11, [x15]
    ldp q12, q13, [x15, #32]
    ldr d14, [x1, x7]
    add x15, x15, x6
    fmla    v29.2d, v10.2d, v0.2d
    fmla    v24.2d, v10.2d, v11.2d
    fmla    v17.2d, v10.2d, v12.2d
    fmla    v2.2d, v10.2d, v13.2d
    ldr d10, [x1, x9]
    fmla    v30.2d, v0.2d, v14.d[0]
    fmla    v25.2d, v11.2d, v14.d[0]
    fmla    v18.2d, v12.2d, v14.d[0]
    fmla    v4.2d, v13.2d, v14.d[0]
    ldr d14, [x1, x10]
    fmla    v31.2d, v0.2d, v10.d[0]
    fmla    v26.2d, v11.2d, v10.d[0]
    fmla    v19.2d, v12.2d, v10.d[0]
    fmla    v5.2d, v13.2d, v10.d[0]
    ldr d10, [x1, x12]
    fmla    v8.2d, v0.2d, v14.d[0]
    fmla    v27.2d, v11.2d, v14.d[0]
    fmla    v20.2d, v12.2d, v14.d[0]
    fmla    v6.2d, v13.2d, v14.d[0]
    ldr d14, [x1, x14]
    fmla    v9.2d, v0.2d, v10.d[0]
    fmla    v28.2d, v11.2d, v10.d[0]
    fmla    v21.2d, v12.2d, v10.d[0]
    fmla    v7.2d, v13.2d, v10.d[0]
    cmp x15, x11
    fmla    v23.2d, v0.2d, v14.d[0]
    fmla    v22.2d, v11.2d, v14.d[0]
    fmla    v16.2d, v12.2d, v14.d[0]
    fmla    v3.2d, v13.2d, v14.d[0]
    mov x1, x20
    b.ls    L328

It has 24 fmas 10 loads 0 stores

The Clang generated assembly looks pretty good. But it's still slow. 0.135 is decent, that is 92% peak.

I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think is too small) I don't get the peak.

Are you able to get 100% of the peak for large enough matrices in Julia, whether M1 or Intel? I don't care which platform.

Getting close to the peak is very easy on Tiger Lake, e.g. using 72x72 matrices:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(AmulB!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               4.70e+09  100.0%  #  4.2 cycles per ns
└ task-clock               1.11e+09  100.0%  #  1.1 s
┌ instructions             8.13e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.78e+08  100.0%  #  2.2% of instructions
└ branch-misses            2.50e+06  100.0%  #  1.4% of branch instructions
┌ L1-dcache-load-misses    5.48e+08  100.0%  # 21.0% of dcache loads
│ L1-dcache-loads          2.61e+09  100.0%
│ cache-misses             1.73e+03  100.0%  # 72.7% of cache references
└ cache-references         2.38e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 4.7e9 / (100_000 * 72^3)
0.12592163923182442

Or using C[m,n] += instead:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(AmulBplusC!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               4.74e+09  100.0%  #  4.2 cycles per ns
└ task-clock               1.12e+09  100.0%  #  1.1 s
┌ instructions             8.19e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.78e+08  100.0%  #  2.2% of instructions
└ branch-misses            2.50e+06  100.0%  #  1.4% of branch instructions
┌ L1-dcache-load-misses    5.51e+08  100.0%  # 20.7% of dcache loads
│ L1-dcache-loads          2.67e+09  100.0%
│ cache-misses             1.79e+03  100.0%  # 68.8% of cache references
└ cache-references         2.60e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 4.74e9 / (100_000 * 72^3)
0.12699331275720166

As before, Tiger Lake is very starved on FMA units, hence it's easy to get bottlenecked there, and thus achieve nearly peak FMA unit utilization. But this is about as close as I can get.

The assembly:

L480:
    vbroadcastsd    zmm29, qword ptr [r8]
    vmovupd zmm30, zmmword ptr [rdx]
    vmovupd zmm31, zmmword ptr [rdx + 64]
    vmovupd zmm28, zmmword ptr [rdx + 128]
    prefetcht0  byte ptr [rdx + rbx]
    prefetcht0  byte ptr [rdx + rbx + 64]
    prefetcht0  byte ptr [rdx + rbx + 128]
    vfmadd231pd zmm19, zmm30, zmm29     # zmm19 = (zmm30 * zmm29) + zmm19
    vfmadd231pd zmm10, zmm31, zmm29     # zmm10 = (zmm31 * zmm29) + zmm10
    vbroadcastsd    zmm0, qword ptr [r8 + r14]
    vfmadd231pd zmm1, zmm28, zmm29      # zmm1 = (zmm28 * zmm29) + zmm1
    vfmadd231pd zmm20, zmm30, zmm0      # zmm20 = (zmm30 * zmm0) + zmm20
    vfmadd231pd zmm11, zmm31, zmm0      # zmm11 = (zmm31 * zmm0) + zmm11
    vfmadd231pd zmm2, zmm28, zmm0       # zmm2 = (zmm28 * zmm0) + zmm2
    vbroadcastsd    zmm0, qword ptr [r8 + 2*r14]
    vfmadd231pd zmm21, zmm30, zmm0      # zmm21 = (zmm30 * zmm0) + zmm21
    vfmadd231pd zmm12, zmm31, zmm0      # zmm12 = (zmm31 * zmm0) + zmm12
    vfmadd231pd zmm3, zmm28, zmm0       # zmm3 = (zmm28 * zmm0) + zmm3
    vbroadcastsd    zmm0, qword ptr [r8 + r13]
    vfmadd231pd zmm22, zmm30, zmm0      # zmm22 = (zmm30 * zmm0) + zmm22
    vfmadd231pd zmm13, zmm31, zmm0      # zmm13 = (zmm31 * zmm0) + zmm13
    vfmadd231pd zmm4, zmm28, zmm0       # zmm4 = (zmm28 * zmm0) + zmm4
    vbroadcastsd    zmm0, qword ptr [r8 + 4*r14]
    vfmadd231pd zmm23, zmm30, zmm0      # zmm23 = (zmm30 * zmm0) + zmm23
    vfmadd231pd zmm14, zmm31, zmm0      # zmm14 = (zmm31 * zmm0) + zmm14
    vfmadd231pd zmm5, zmm28, zmm0       # zmm5 = (zmm28 * zmm0) + zmm5
    vbroadcastsd    zmm0, qword ptr [r8 + r15]
    vfmadd231pd zmm24, zmm30, zmm0      # zmm24 = (zmm30 * zmm0) + zmm24
    vfmadd231pd zmm15, zmm31, zmm0      # zmm15 = (zmm31 * zmm0) + zmm15
    vfmadd231pd zmm6, zmm28, zmm0       # zmm6 = (zmm28 * zmm0) + zmm6
    vbroadcastsd    zmm0, qword ptr [r8 + 2*r13]
    vfmadd231pd zmm25, zmm30, zmm0      # zmm25 = (zmm30 * zmm0) + zmm25
    vfmadd231pd zmm16, zmm31, zmm0      # zmm16 = (zmm31 * zmm0) + zmm16
    vbroadcastsd    zmm29, qword ptr [r8 + rax]
    vfmadd231pd zmm7, zmm28, zmm0       # zmm7 = (zmm28 * zmm0) + zmm7
    vfmadd231pd zmm26, zmm30, zmm29     # zmm26 = (zmm30 * zmm29) + zmm26
    vfmadd231pd zmm17, zmm31, zmm29     # zmm17 = (zmm31 * zmm29) + zmm17
    vfmadd231pd zmm8, zmm28, zmm29      # zmm8 = (zmm28 * zmm29) + zmm8
    vbroadcastsd    zmm0, qword ptr [r8 + 8*r14]
    vfmadd231pd zmm27, zmm0, zmm30      # zmm27 = (zmm0 * zmm30) + zmm27
    vfmadd231pd zmm18, zmm0, zmm31      # zmm18 = (zmm0 * zmm31) + zmm18
    vfmadd231pd zmm9, zmm28, zmm0       # zmm9 = (zmm28 * zmm0) + zmm9
    add rdx, r10
    add r8, 8
    cmp rdx, rsi
    jbe L480

This is 27 fmas 12 loads (+3 prefetches) 0 stores

Doubling the FMA units on Cascadelake (and sacrificing a lot of out of order capabilities, a bit of cache size, etc...):

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(AmulBplusC!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.51e+09   49.9%  #  3.6 cycles per ns
┌ instructions             8.24e+09   75.0%  #  3.3 insns per cycle
│ branch-instructions      1.88e+08   75.0%  #  2.3% of instructions
└ branch-misses            1.01e+06   75.0%  #  0.5% of branch instructions
┌ task-clock               7.01e+08  100.0%  # 701.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    6.90e+08   25.0%  # 25.7% of dcache loads
│ L1-dcache-loads          2.69e+09   25.0%
└ L1-icache-load-misses    2.43e+04   25.0%
┌ dTLB-load-misses         0.00e+00   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               2.69e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 2.51e9 / (100_000 * 72^3)
0.06724751371742113

We're at about 93% instead of the 98%+ on Tiger Lake.

The Clang results hint that gemm_nkm is probably exactly written to be optimal and to vectorize over m:

The inner most loop in the two assembly examples I posted above were in fact the k loop, while both the m and n loops were unrolled. LV will "canonicalize" to produce this code regardless of the initial loop order.

The approach is to: 1) Hoist the C loads and stores out of the loop. 2) Unroll m and also vectorize it. 3) Unroll n and also vectorize it.

For each factor by which n is unrolled, you can reuse a load from A[m,k]. For each factor by which m is unrolled, you can reuse a load from B[k,n]. Hence, unrolling both of these loops is a great means of maximizing load-reuse, and thus increasing the fma-to-load ratio.

In both assembly examples above, the m unrolling happens first (with all loads from A), and then the unrolled code "iteratively" loads/broadcasts scalars from B.

We are of course limited by the number of named vector registers (the actual number of vector registers numbers in the hundreds for each of these architectures...) in how much unrolling we can actually do without spilling registers, which would of course defeat the point of reducing the amount of loads and stores.

I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think is too small) I don't get the peak.

Are you able to get 100% of the peak for large enough matrices in Julia, whether M1 or Intel? I don't care which platform.

LV currently doesn't do cache-level tiling, which is necessary for good performance outside of the L2 cache.

The best I seem to be able to do (Tiger Lake) is around the 144-square mark:

julia> M = K = N = 144; A = rand(M,K); B = rand(K,N); C1 = A*B; C0 = similar(C1);

julia> C0 .= 0;

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(AmulBplusC!, 10_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               3.76e+09  100.0%  #  4.2 cycles per ns
└ task-clock               8.91e+08  100.0%  # 891.5 ms
┌ instructions             6.45e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.40e+08  100.0%  #  2.2% of instructions
└ branch-misses            9.73e+05  100.0%  #  0.7% of branch instructions
┌ L1-dcache-load-misses    4.92e+08  100.0%  # 23.4% of dcache loads
│ L1-dcache-loads          2.10e+09  100.0%
│ cache-misses             5.42e+03  100.0%  # 22.6% of cache references
└ cache-references         2.40e+04  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 3.76e9 / (10_000 * 144^3)
0.12592163923182442

julia> 0.125/ans
0.9926808510638298

While the amount of memory occupied by the matrices is O(N^2), the amount of memory bandwidth needed is O(N^3), thus this becomes the dominating factor to achieving good performance at large sizes.

Thanks for the analysis. I mean, at the end of the day, over 90% peak is excellent and you are getting it everywhere.

For example on M1, Clang gets about 60% of peak if I help it (gemm_nkm) or at 7% peak without help (gemm_mnk). My own Julia+LV due to the running via Rosetta runs at 43%, but no matter how you write it. And on your machine when compiled properly it runs at 92% peak.

I would say that if LV consistently delivers over 90% peak, that's very good. It means if somebody comes and create a library that runs at 100% peak, it will only be 1.1x faster (10%).

It'd be nice to visualize this, to make it clear. If you plot the clock counts, you will have a plot at the bottom, that is the peak, and then everything else is above it. And you can see how far each benchmark is.

JuliaSIMD / LoopVectorization.jl

Matrix Multiplication benchmark analysis #356

Theoretical Peak Performance