Open certik opened 3 years ago
The Clang results hint that gemm_nkm
is probably exactly written to be optimal and to vectorize over m
:
for (long n = 0; n < N; n++){
for (long k = 0; k < K; k++){
for (long m = 0; m < M; m++){
C[m + n*M] += A[m + k*M] * B[k + n*K];
}
}
}
In this inner loop, vectorized over m
, we have one memory write (C[m + n*M]
), one memory read (A[m + k*M]
) and one fma. Everything else is a constant. On M1: memory write is 0.25 per double, memory read is 0.1666 per double. fma is 0.125 per double. So if we only had the inner loop, the bottleneck would be memory write. However, the writes to C[m + n*M]
can be reused in the loop over k
. The same with memory reads, always only two loops out of three access it, but the fma is used by all the loops.
Another way to look at it is that the complexity is O(N^3), we have N^3 fma, but only N^2 memory writes (C
), 2*N^2 memory reads (A
and B
).
So the bottleneck is actually only fma
for large enough N.
The theoretical peak performance for the loop body on M1 is thus 0.125 clock cycles per double.
The theoretical peak performance for the loop body on M1 is thus 0.125 clock cycles per double.
I get fairly close to this on the M1:
julia> @benchmark A_mul_B2!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 11.083 μs … 22.625 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 11.167 μs ┊ GC (median): 0.00%
Time (mean ± σ): 11.189 μs ± 306.488 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁█
▂▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▅▄▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▂ ▂
11.1 μs Histogram: frequency by time 11.3 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> cpu_freq = 3.2e9 # Ghz
3.2e9
julia> 11.083e-6 * cpu_freq / N^3
0.13529052734375
julia> versioninfo()
Julia Version 1.8.0-DEV.901
Commit 47255f9fe8* (2021-11-06 19:56 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.1.0)
CPU: Apple M1
Rosetta cannot use FMA instructions, because Westmere x86 CPUs do not have them, so it is limited to at best half the potential performance achievable.
Also, with LV, you can write the loops is any order you'd like, and it should interchange them as necessary.
The Clang generated assembly looks pretty good. But it's still slow. 0.135 is decent, that is 92% peak.
I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think is too small) I don't get the peak.
Are you able to get 100% of the peak for large enough matrices in Julia, whether M1 or Intel? I don't care which platform.
Here is the inner loop LV generates for Neon:
L328:
ld1r { v10.2d }, [x20], #8
ldp q0, q11, [x15]
ldp q12, q13, [x15, #32]
ldr d14, [x1, x7]
add x15, x15, x6
fmla v29.2d, v10.2d, v0.2d
fmla v24.2d, v10.2d, v11.2d
fmla v17.2d, v10.2d, v12.2d
fmla v2.2d, v10.2d, v13.2d
ldr d10, [x1, x9]
fmla v30.2d, v0.2d, v14.d[0]
fmla v25.2d, v11.2d, v14.d[0]
fmla v18.2d, v12.2d, v14.d[0]
fmla v4.2d, v13.2d, v14.d[0]
ldr d14, [x1, x10]
fmla v31.2d, v0.2d, v10.d[0]
fmla v26.2d, v11.2d, v10.d[0]
fmla v19.2d, v12.2d, v10.d[0]
fmla v5.2d, v13.2d, v10.d[0]
ldr d10, [x1, x12]
fmla v8.2d, v0.2d, v14.d[0]
fmla v27.2d, v11.2d, v14.d[0]
fmla v20.2d, v12.2d, v14.d[0]
fmla v6.2d, v13.2d, v14.d[0]
ldr d14, [x1, x14]
fmla v9.2d, v0.2d, v10.d[0]
fmla v28.2d, v11.2d, v10.d[0]
fmla v21.2d, v12.2d, v10.d[0]
fmla v7.2d, v13.2d, v10.d[0]
cmp x15, x11
fmla v23.2d, v0.2d, v14.d[0]
fmla v22.2d, v11.2d, v14.d[0]
fmla v16.2d, v12.2d, v14.d[0]
fmla v3.2d, v13.2d, v14.d[0]
mov x1, x20
b.ls L328
It has 24 fmas 10 loads 0 stores
The Clang generated assembly looks pretty good. But it's still slow. 0.135 is decent, that is 92% peak.
I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think is too small) I don't get the peak.
Are you able to get 100% of the peak for large enough matrices in Julia, whether M1 or Intel? I don't care which platform.
Getting close to the peak is very easy on Tiger Lake, e.g. using 72x72 matrices:
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(AmulB!, 100_000, C0, A, B)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 4.70e+09 100.0% # 4.2 cycles per ns
└ task-clock 1.11e+09 100.0% # 1.1 s
┌ instructions 8.13e+09 100.0% # 1.7 insns per cycle
│ branch-instructions 1.78e+08 100.0% # 2.2% of instructions
└ branch-misses 2.50e+06 100.0% # 1.4% of branch instructions
┌ L1-dcache-load-misses 5.48e+08 100.0% # 21.0% of dcache loads
│ L1-dcache-loads 2.61e+09 100.0%
│ cache-misses 1.73e+03 100.0% # 72.7% of cache references
└ cache-references 2.38e+03 100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> 4.7e9 / (100_000 * 72^3)
0.12592163923182442
Or using C[m,n] +=
instead:
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(AmulBplusC!, 100_000, C0, A, B)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 4.74e+09 100.0% # 4.2 cycles per ns
└ task-clock 1.12e+09 100.0% # 1.1 s
┌ instructions 8.19e+09 100.0% # 1.7 insns per cycle
│ branch-instructions 1.78e+08 100.0% # 2.2% of instructions
└ branch-misses 2.50e+06 100.0% # 1.4% of branch instructions
┌ L1-dcache-load-misses 5.51e+08 100.0% # 20.7% of dcache loads
│ L1-dcache-loads 2.67e+09 100.0%
│ cache-misses 1.79e+03 100.0% # 68.8% of cache references
└ cache-references 2.60e+03 100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> 4.74e9 / (100_000 * 72^3)
0.12699331275720166
As before, Tiger Lake is very starved on FMA units, hence it's easy to get bottlenecked there, and thus achieve nearly peak FMA unit utilization. But this is about as close as I can get.
The assembly:
L480:
vbroadcastsd zmm29, qword ptr [r8]
vmovupd zmm30, zmmword ptr [rdx]
vmovupd zmm31, zmmword ptr [rdx + 64]
vmovupd zmm28, zmmword ptr [rdx + 128]
prefetcht0 byte ptr [rdx + rbx]
prefetcht0 byte ptr [rdx + rbx + 64]
prefetcht0 byte ptr [rdx + rbx + 128]
vfmadd231pd zmm19, zmm30, zmm29 # zmm19 = (zmm30 * zmm29) + zmm19
vfmadd231pd zmm10, zmm31, zmm29 # zmm10 = (zmm31 * zmm29) + zmm10
vbroadcastsd zmm0, qword ptr [r8 + r14]
vfmadd231pd zmm1, zmm28, zmm29 # zmm1 = (zmm28 * zmm29) + zmm1
vfmadd231pd zmm20, zmm30, zmm0 # zmm20 = (zmm30 * zmm0) + zmm20
vfmadd231pd zmm11, zmm31, zmm0 # zmm11 = (zmm31 * zmm0) + zmm11
vfmadd231pd zmm2, zmm28, zmm0 # zmm2 = (zmm28 * zmm0) + zmm2
vbroadcastsd zmm0, qword ptr [r8 + 2*r14]
vfmadd231pd zmm21, zmm30, zmm0 # zmm21 = (zmm30 * zmm0) + zmm21
vfmadd231pd zmm12, zmm31, zmm0 # zmm12 = (zmm31 * zmm0) + zmm12
vfmadd231pd zmm3, zmm28, zmm0 # zmm3 = (zmm28 * zmm0) + zmm3
vbroadcastsd zmm0, qword ptr [r8 + r13]
vfmadd231pd zmm22, zmm30, zmm0 # zmm22 = (zmm30 * zmm0) + zmm22
vfmadd231pd zmm13, zmm31, zmm0 # zmm13 = (zmm31 * zmm0) + zmm13
vfmadd231pd zmm4, zmm28, zmm0 # zmm4 = (zmm28 * zmm0) + zmm4
vbroadcastsd zmm0, qword ptr [r8 + 4*r14]
vfmadd231pd zmm23, zmm30, zmm0 # zmm23 = (zmm30 * zmm0) + zmm23
vfmadd231pd zmm14, zmm31, zmm0 # zmm14 = (zmm31 * zmm0) + zmm14
vfmadd231pd zmm5, zmm28, zmm0 # zmm5 = (zmm28 * zmm0) + zmm5
vbroadcastsd zmm0, qword ptr [r8 + r15]
vfmadd231pd zmm24, zmm30, zmm0 # zmm24 = (zmm30 * zmm0) + zmm24
vfmadd231pd zmm15, zmm31, zmm0 # zmm15 = (zmm31 * zmm0) + zmm15
vfmadd231pd zmm6, zmm28, zmm0 # zmm6 = (zmm28 * zmm0) + zmm6
vbroadcastsd zmm0, qword ptr [r8 + 2*r13]
vfmadd231pd zmm25, zmm30, zmm0 # zmm25 = (zmm30 * zmm0) + zmm25
vfmadd231pd zmm16, zmm31, zmm0 # zmm16 = (zmm31 * zmm0) + zmm16
vbroadcastsd zmm29, qword ptr [r8 + rax]
vfmadd231pd zmm7, zmm28, zmm0 # zmm7 = (zmm28 * zmm0) + zmm7
vfmadd231pd zmm26, zmm30, zmm29 # zmm26 = (zmm30 * zmm29) + zmm26
vfmadd231pd zmm17, zmm31, zmm29 # zmm17 = (zmm31 * zmm29) + zmm17
vfmadd231pd zmm8, zmm28, zmm29 # zmm8 = (zmm28 * zmm29) + zmm8
vbroadcastsd zmm0, qword ptr [r8 + 8*r14]
vfmadd231pd zmm27, zmm0, zmm30 # zmm27 = (zmm0 * zmm30) + zmm27
vfmadd231pd zmm18, zmm0, zmm31 # zmm18 = (zmm0 * zmm31) + zmm18
vfmadd231pd zmm9, zmm28, zmm0 # zmm9 = (zmm28 * zmm0) + zmm9
add rdx, r10
add r8, 8
cmp rdx, rsi
jbe L480
This is 27 fmas 12 loads (+3 prefetches) 0 stores
Doubling the FMA units on Cascadelake (and sacrificing a lot of out of order capabilities, a bit of cache size, etc...):
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(AmulBplusC!, 100_000, C0, A, B)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 2.51e+09 49.9% # 3.6 cycles per ns
┌ instructions 8.24e+09 75.0% # 3.3 insns per cycle
│ branch-instructions 1.88e+08 75.0% # 2.3% of instructions
└ branch-misses 1.01e+06 75.0% # 0.5% of branch instructions
┌ task-clock 7.01e+08 100.0% # 701.0 ms
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 6.90e+08 25.0% # 25.7% of dcache loads
│ L1-dcache-loads 2.69e+09 25.0%
└ L1-icache-load-misses 2.43e+04 25.0%
┌ dTLB-load-misses 0.00e+00 25.0% # 0.0% of dTLB loads
└ dTLB-loads 2.69e+09 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> 2.51e9 / (100_000 * 72^3)
0.06724751371742113
We're at about 93% instead of the 98%+ on Tiger Lake.
The Clang results hint that
gemm_nkm
is probably exactly written to be optimal and to vectorize overm
:
The inner most loop in the two assembly examples I posted above were in fact the k
loop, while both the m
and n
loops were unrolled.
LV will "canonicalize" to produce this code regardless of the initial loop order.
The approach is to:
1) Hoist the C
loads and stores out of the loop.
2) Unroll m
and also vectorize it.
3) Unroll n
and also vectorize it.
For each factor by which n
is unrolled, you can reuse a load from A[m,k]
.
For each factor by which m
is unrolled, you can reuse a load from B[k,n]
.
Hence, unrolling both of these loops is a great means of maximizing load-reuse, and thus increasing the fma-to-load ratio.
In both assembly examples above, the m
unrolling happens first (with all loads from A
), and then the unrolled code "iteratively" loads/broadcasts scalars from B
.
We are of course limited by the number of named vector registers (the actual number of vector registers numbers in the hundreds for each of these architectures...) in how much unrolling we can actually do without spilling registers, which would of course defeat the point of reducing the amount of loads and stores.
I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think is too small) I don't get the peak.
Are you able to get 100% of the peak for large enough matrices in Julia, whether M1 or Intel? I don't care which platform.
LV currently doesn't do cache-level tiling, which is necessary for good performance outside of the L2 cache.
The best I seem to be able to do (Tiger Lake) is around the 144-square mark:
julia> M = K = N = 144; A = rand(M,K); B = rand(K,N); C1 = A*B; C0 = similar(C1);
julia> C0 .= 0;
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(AmulBplusC!, 10_000, C0, A, B)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 3.76e+09 100.0% # 4.2 cycles per ns
└ task-clock 8.91e+08 100.0% # 891.5 ms
┌ instructions 6.45e+09 100.0% # 1.7 insns per cycle
│ branch-instructions 1.40e+08 100.0% # 2.2% of instructions
└ branch-misses 9.73e+05 100.0% # 0.7% of branch instructions
┌ L1-dcache-load-misses 4.92e+08 100.0% # 23.4% of dcache loads
│ L1-dcache-loads 2.10e+09 100.0%
│ cache-misses 5.42e+03 100.0% # 22.6% of cache references
└ cache-references 2.40e+04 100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> 3.76e9 / (10_000 * 144^3)
0.12592163923182442
julia> 0.125/ans
0.9926808510638298
While the amount of memory occupied by the matrices is O(N^2), the amount of memory bandwidth needed is O(N^3), thus this becomes the dominating factor to achieving good performance at large sizes.
Thanks for the analysis. I mean, at the end of the day, over 90% peak is excellent and you are getting it everywhere.
For example on M1, Clang gets about 60% of peak if I help it (gemm_nkm
) or at 7% peak without help (gemm_mnk
). My own Julia+LV due to the running via Rosetta runs at 43%, but no matter how you write it. And on your machine when compiled properly it runs at 92% peak.
I would say that if LV consistently delivers over 90% peak, that's very good. It means if somebody comes and create a library that runs at 100% peak, it will only be 1.1x faster (10%).
It'd be nice to visualize this, to make it clear. If you plot the clock counts, you will have a plot at the bottom, that is the peak, and then everything else is above it. And you can see how far each benchmark is.
See #355 for the background. I am now applying the same methodology to the matrix multiplication benchmark:
I modified the kernel to not zero
C
, to simplify things a bit:In C++ I tested two kernels:
For N=64, I am getting:
Aren't these two kernels equivalent? They seem to operate on the same matrix layout and return exactly the same answer, they just differ in loop ordering.
The inner loop from
gemm_nkm
as generated by Clang seems to be: