Closed ViralBShah closed 1 year ago
cc @philipturner
Is it possible to perform some of the computations in the Cholesky algorithm as GEMM? Also, to investigate performance, please:
sudo powermetrics --samplers cpu_power
while testing bothFrom the stats so far:
Library | GFLOPS/k | GFLOPS (k=1/3) |
---|---|---|
OpenBLAS | 659 | 215 |
Accelerate | 656 | 220 |
Are you using double precision?
Most of the computation in cholesky should be GEMM. My tests above were double precision.
Could you show the raw execution times for some matrices smaller than 12,000? That size exceeds the system-level cache. To debug this, I want to calculate GFLOPS directly from the raw data. 1024, 2048, and 4096 would suffice.
OpenBLAS Float64:
julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(n,n); ata = a'a; @time cholesky(ata); end
0.011238 seconds (3 allocations: 8.000 MiB)
0.019787 seconds (3 allocations: 32.000 MiB)
0.151810 seconds (3 allocations: 128.000 MiB, 0.36% gc time)
0.827275 seconds (3 allocations: 512.000 MiB, 0.01% gc time)
Accelerate Float64:
julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(n,n); ata = a'a; @time cholesky(ata); end
0.005091 seconds (3 allocations: 8.000 MiB)
0.025742 seconds (3 allocations: 32.000 MiB)
0.131629 seconds (3 allocations: 128.000 MiB, 0.45% gc time)
0.900975 seconds (3 allocations: 512.000 MiB, 0.01% gc time)
OpenBLAS Float32:
julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n)+n*I; ata = a'a; @time cholesky(ata); end
0.003528 seconds (3 allocations: 4.000 MiB)
0.012166 seconds (3 allocations: 16.000 MiB)
0.072055 seconds (3 allocations: 64.000 MiB, 0.20% gc time)
0.415257 seconds (3 allocations: 256.000 MiB, 0.03% gc time)
AppleAccelerate Float32:
julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n)+n*I; ata = a'a; @time cholesky(ata); end
0.003183 seconds (3 allocations: 4.000 MiB)
0.012119 seconds (3 allocations: 16.000 MiB)
0.070706 seconds (3 allocations: 64.000 MiB, 0.20% gc time)
0.366411 seconds (3 allocations: 256.000 MiB, 0.04% gc time)
Number of computations = kn^3
Time | O64 | A64 | O32 | A32 |
---|---|---|---|---|
1024 | 0.011238 | 0.005091 | 0.003528 | 0.003183 |
2048 | 0.019787 | 0.025742 | 0.012166 | 0.012119 |
4096 | 0.151810 | 0.131629 | 0.072055 | 0.070706 |
8192 | 0.827275 | 0.900975 | 0.415257 | 0.366411 |
GFLOPS/k | O64 | A64 | O32 | A32 |
---|---|---|---|---|
1024 | 96 | 211 | 304 | 337 |
2048 | 434 | 334 | 706 | 709 |
4096 | 453 | 522 | 954 | 972 |
8192 | 665 | 610 | 1324 | 1500 |
GFLOPS (k=1/3) | O64 | A64 | O32 | A32 |
---|---|---|---|---|
1024 | 32 | 70 | 101 | 112 |
2048 | 145 | 111 | 235 | 236 |
4096 | 151 | 174 | 318 | 324 |
8192 | 222 | 203 | 441 | 500 |
GFLOPS:
500 GFLOPS F32 shows Accelerate at least isn't using AMX vecF32. Next, I'll need those benchmarks repeated, while forcing all libraries to use one CPU core.
I think my value for k is wrong. Which one is Julia using? Perhaps Cholesky decomposition algorithms utilizing GEMM have more computations.
Also, are you running the Cholesky decompositions once, or running several times and reporting the best performing iteration? The latter is necessary for accurate benchmarking.
The several iterations also need to happen in quick succession (ideally 0 milliseconds apart) to fill the caches. That is how to find the maximum GFLOPS.
I have not used BenchmarkTools
etc. I believe GEMM uses the same number of flops as the regular un-blocked algorithm.
We should be able to examine the OpenBLAS code base and find the number of computations inside the inner loop. Then we would have a reliable estimate of k, although k could be different for Accelerate.
I'm quite certain they will both be the same algorithms. But I believe Accelerate calls stock LAPACK dpotrf. OpenBLAS uses the same algorithm, but they overlay the LAPACK version with their own multi-threaded one.
If you are doing stock LAPACK + BLAS GEMM, you have sequential Cholesky with parallel threaded GEMM. However, if you multi-thread your Cholesky, you only use small single threaded GEMM kernels inside, and IIUC, there is more parallelism that way. The flop rate is the same.
We can compare execution speed of the two algorithms.
Cholesky (M2 Max, 3.??? GHz multi-core, 3.660 GHz single-core) - k=???
GFLOPS/k | O64 | A64 | O32 | A32 |
---|---|---|---|---|
1024 | 96 | 211 | 304 | 337 |
2048 | 434 | 334 | 706 | 709 |
4096 | 453 | 522 | 954 | 972 |
8192 | 665 | 610 | 1324 | 1500 |
GEMM (M1 Max, 3.064 GHz multi-core, 3.228 GHz single-core) - k=2
GFLOPS/k | O64 | A64 | O32 | A32 |
---|---|---|---|---|
~1000 | 176 | 337 | 327 | 1327 |
Comparing OpenBLAS, which only uses NEON units, k might be 0.560-0.600. I was not expecting 1/2 to be a possible value. I'll need to finish my AMX benchmark script to get a definitive answer. Faster Cholesky execution times will decrease the estimate of k.
A great test of AMX usage would be to run Cholesky on complex numbers. If OpenBLAS comes out much faster, we know Accelerate is using the AMX. The inverse statement is not true though.
OpenBLAS ComplexF32 multi-threaded:
julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n) +I*n; ata = Hermitian(a'a + im*a'a); @time cholesky(ata); end
0.010979 seconds (3 allocations: 8.000 MiB)
0.037187 seconds (3 allocations: 32.000 MiB)
0.240612 seconds (3 allocations: 128.000 MiB, 0.72% gc time)
1.502579 seconds (3 allocations: 512.000 MiB, 0.50% gc time)
Accelerate ComplexF32 multi-threaded:
julia> using AppleAccelerate
julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n) +I*n; ata = Hermitian(a'a + im*a'a); @time cholesky(ata); end
0.025775 seconds (3 allocations: 8.000 MiB)
0.108255 seconds (3 allocations: 32.000 MiB)
0.609296 seconds (3 allocations: 128.000 MiB, 0.29% gc time)
3.702357 seconds (3 allocations: 512.000 MiB, 0.17% gc time)
OpenBLAS ComplexF32 1 thread:
julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n) +I*n; ata = Hermitian(a'a + im*a'a); @time cholesky(ata); end
0.024118 seconds (3 allocations: 8.000 MiB)
0.159501 seconds (3 allocations: 32.000 MiB, 1.11% gc time)
1.209626 seconds (3 allocations: 128.000 MiB, 0.17% gc time)
9.236193 seconds (3 allocations: 512.000 MiB, 0.09% gc time)
Accelerate ComplexF32 1 thread:
julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n) +I*n; ata = Hermitian(a'a + im*a'a); @time cholesky(ata); end
0.027845 seconds (3 allocations: 8.000 MiB)
0.116399 seconds (3 allocations: 32.000 MiB, 6.29% gc time)
0.592916 seconds (3 allocations: 128.000 MiB, 0.48% gc time)
4.168198 seconds (3 allocations: 512.000 MiB, 0.16% gc time)
cc @chriselrod (just fyi)
Can I have the source code you used to restrict the number of threads in Accelerate? It does not seems like MT and ST have any performance difference.
Also can you prepare some code that loops over the n=8192 version several times? I want us to measure power consumption soon.
I just used the VECLIB_MAX_THREADS you had linked to in one of the comments.
ENV["VECLIB_MAX_THREADS"]=1
You can use BenchmarkTools.jl
to benchmark which will do what you want. Just put @bench
in front of the cholesky call.
You can use
BenchmarkTools.jl
to benchmark which will do what you want. Just put@bench
in front of the cholesky call.
Thanks for directions. I haven't actually installed Julia though; I'm just commenting on performance.
I just used the VECLIB_MAX_THREADS you had linked to in one of the comments.
There's a slight misspelling. You need to set VECLIB_MAXIMUM_THREADS
for it to work properly. Currently it has MAX
instead of MAXIMUM
.
You can use
BenchmarkTools.jl
to benchmark which will do what you want. Just put@bench
in front of the cholesky call.Thanks for directions. I haven't actually installed Julia though; I'm just commenting on performance.
Maybe this prompts you to get a bit into Julia and learn a few tricks :-) I suspected as much looking at your comments. But it is really straightforward to take what I pasted above and rerun things. I'm happy to put a script together as well.
I just used the VECLIB_MAX_THREADS you had linked to in one of the comments.
There's a slight misspelling. You need to set
VECLIB_MAXIMUM_THREADS
for it to work properly. Currently it hasMAX
instead ofMAXIMUM
.
The perils of doing github on the phone. I checked my session history and indeed did use MAXIMUM.
Maybe this prompts you to get a bit into Julia and learn a few tricks :-) I suspected as much looking at your comments. But it is really straightforward to take what I pasted above and rerun things. I'm happy to put a script together as well.
I'm all for compiled languages that make code easier to maintain without sacrificing performance. I think it's ultimately my responsibility to finish my AMX benchmarks script. That would provide better data anyway. POTRF is one of the functions needed for quantum chemistry.
Generally speaking, you only implement an optimization if there's a detectable and consistent improvement everywhere. Your stats for single-threaded complex F32 show it's definitely using AMX (vector? matrix?) instructions. However, Accelerate is slower than OpenBLAS for multithreaded complex. I would not recommend optimizing a perceived "more-common" real-valued use case, only to silently harm a "less-common" complex-valued use case.
My amx-benchmarks repo showed that with GEMM, complex-valued may be faster than on Accelerate than OpenBLAS. I didn't realize this, but interleaved complex format harms utilization of NEON units. It just harms AMX to a greater extent. Compound that with the AMX already having less vector FLOPS, the perf delta for things like Cholesky gets quite severe.
The only thing that was consistently faster: FP32 real-valued Cholesky. That is a very narrow use case, by a single-digit percentage margin. The software maintenance effort may not be worth it.
I'm happy to put a script together as well.
Would you be willing to do the same thing you’ve already done with Cholesky, but using all the other functions described on amx-benchmarks? TRSM might be faster with AMX as well. I would think that task isn’t too difficult on your end. I’m fine if you stick with your one-shot benchmark estimate.
Also I found the outer-product version of Cholesky (which would use AMX) and the coefficient is kn^3 = (1/3)n^3 FLOPs. But each subsequent inner loop iteration does a smaller product, which could be the cause of AMX underutilization.
I'm closing this since there isn't much we can do here. We'll get faster solvers as Apple updates Accelerate.
I'm using an M2 Max. On matmul, Accelerate gives good performance compared to openblas:
On Cholesky decomposition, Accelerate has roughly the same performance as openblas, perhaps because openblas has a multi-threaded cholesky while Apple is probably only getting multi-threading from L3 BLAS through the stock LAPACK cholesky.