Cholesky performance could (perhaps) be better

ViralBShah commented 1 year ago

I'm using an M2 Max. On matmul, Accelerate gives good performance compared to openblas:

julia> peakflops(4000)
2.423671700050223e11

julia> using AppleAccelerate

julia> peakflops(4000)
4.325152941800755e11

On Cholesky decomposition, Accelerate has roughly the same performance as openblas, perhaps because openblas has a multi-threaded cholesky while Apple is probably only getting multi-threading from L3 BLAS through the stock LAPACK cholesky.

julia> a = rand(12000,12000); ata = a'a;

julia> using LinearAlgebra

julia> BLAS.set_num_threads(8)

julia> @time cholesky(ata);
  2.618964 seconds (3 allocations: 1.073 GiB, 0.53% gc time)

julia> using AppleAccelerate

julia> @time cholesky(ata);
  2.674928 seconds (3 allocations: 1.073 GiB, 0.11% gc time)

ViralBShah commented 1 year ago

cc @philipturner

philipturner commented 1 year ago

Is it possible to perform some of the computations in the Cholesky algorithm as GEMM? Also, to investigate performance, please:

Run sudo powermetrics --samplers cpu_power while testing both
Check the CPU% in Activity Monitor, as that shows the number of cores
Try both implementations while forcing the backend to single-core CPU. Example here.

philipturner commented 1 year ago

From the stats so far:

Library	GFLOPS/k	GFLOPS (k=1/3)
OpenBLAS	659	215
Accelerate	656	220

Are you using double precision?

NEON: 8 3.660 GHz (512 / 64) * 2 = 468 GFLOPS max.
AMX: 2 3.660 GHz (1024 / 64) * 2 = 234 GFLOPS max (for vector, not matrix, ops).
OpenBLAS is not utilizing all the NEON vector units.

ViralBShah commented 1 year ago

Most of the computation in cholesky should be GEMM. My tests above were double precision.

philipturner commented 1 year ago

Could you show the raw execution times for some matrices smaller than 12,000? That size exceeds the system-level cache. To debug this, I want to calculate GFLOPS directly from the raw data. 1024, 2048, and 4096 would suffice.

ViralBShah commented 1 year ago

OpenBLAS Float64:

julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(n,n); ata = a'a; @time cholesky(ata); end
  0.011238 seconds (3 allocations: 8.000 MiB)
  0.019787 seconds (3 allocations: 32.000 MiB)
  0.151810 seconds (3 allocations: 128.000 MiB, 0.36% gc time)
  0.827275 seconds (3 allocations: 512.000 MiB, 0.01% gc time)

Accelerate Float64:

julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(n,n); ata = a'a; @time cholesky(ata); end
  0.005091 seconds (3 allocations: 8.000 MiB)
  0.025742 seconds (3 allocations: 32.000 MiB)
  0.131629 seconds (3 allocations: 128.000 MiB, 0.45% gc time)
  0.900975 seconds (3 allocations: 512.000 MiB, 0.01% gc time)

OpenBLAS Float32:

julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n)+n*I; ata = a'a; @time cholesky(ata); end
  0.003528 seconds (3 allocations: 4.000 MiB)
  0.012166 seconds (3 allocations: 16.000 MiB)
  0.072055 seconds (3 allocations: 64.000 MiB, 0.20% gc time)
  0.415257 seconds (3 allocations: 256.000 MiB, 0.03% gc time)

AppleAccelerate Float32:

julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n)+n*I; ata = a'a; @time cholesky(ata); end
  0.003183 seconds (3 allocations: 4.000 MiB)
  0.012119 seconds (3 allocations: 16.000 MiB)
  0.070706 seconds (3 allocations: 64.000 MiB, 0.20% gc time)
  0.366411 seconds (3 allocations: 256.000 MiB, 0.04% gc time)

philipturner commented 1 year ago

Number of computations = kn^3

Time	O64	A64	O32	A32
1024	0.011238	0.005091	0.003528	0.003183
2048	0.019787	0.025742	0.012166	0.012119
4096	0.151810	0.131629	0.072055	0.070706
8192	0.827275	0.900975	0.415257	0.366411

GFLOPS/k	O64	A64	O32	A32
1024	96	211	304	337
2048	434	334	706	709
4096	453	522	954	972
8192	665	610	1324	1500

GFLOPS (k=1/3)	O64	A64	O32	A32
1024	32	70	101	112
2048	145	111	235	236
4096	151	174	318	324
8192	222	203	441	500

GFLOPS:

NEON F64: 8 3.660 GHz (512 / 64) * 2 = 468
NEON F32: 8 3.660 GHz (512 / 32) * 2 = 936
AMX vecF64: 2 3.660 GHz (1024 / 64) * 2 = 234
AMX vecF32: 2 3.660 GHz (1024 / 32) * 2 = 468
AMX matF64: 2 * NEON F64 = 936
AMX matF32: 4 * NEON F32 = 3747

500 GFLOPS F32 shows Accelerate at least isn't using AMX vecF32. Next, I'll need those benchmarks repeated, while forcing all libraries to use one CPU core.

I think my value for k is wrong. Which one is Julia using? Perhaps Cholesky decomposition algorithms utilizing GEMM have more computations.

Also, are you running the Cholesky decompositions once, or running several times and reporting the best performing iteration? The latter is necessary for accurate benchmarking.

philipturner commented 1 year ago

The several iterations also need to happen in quick succession (ideally 0 milliseconds apart) to fill the caches. That is how to find the maximum GFLOPS.

ViralBShah commented 1 year ago

I have not used BenchmarkTools etc. I believe GEMM uses the same number of flops as the regular un-blocked algorithm.

philipturner commented 1 year ago

We should be able to examine the OpenBLAS code base and find the number of computations inside the inner loop. Then we would have a reliable estimate of k, although k could be different for Accelerate.

ViralBShah commented 1 year ago

I'm quite certain they will both be the same algorithms. But I believe Accelerate calls stock LAPACK dpotrf. OpenBLAS uses the same algorithm, but they overlay the LAPACK version with their own multi-threaded one.

If you are doing stock LAPACK + BLAS GEMM, you have sequential Cholesky with parallel threaded GEMM. However, if you multi-thread your Cholesky, you only use small single threaded GEMM kernels inside, and IIUC, there is more parallelism that way. The flop rate is the same.

philipturner commented 1 year ago

We can compare execution speed of the two algorithms.

Cholesky (M2 Max, 3.??? GHz multi-core, 3.660 GHz single-core) - k=???

GFLOPS/k	O64	A64	O32	A32
1024	96	211	304	337
2048	434	334	706	709
4096	453	522	954	972
8192	665	610	1324	1500

GEMM (M1 Max, 3.064 GHz multi-core, 3.228 GHz single-core) - k=2

GFLOPS/k	O64	A64	O32	A32
~1000	176	337	327	1327

Comparing OpenBLAS, which only uses NEON units, k might be 0.560-0.600. I was not expecting 1/2 to be a possible value. I'll need to finish my AMX benchmark script to get a definitive answer. Faster Cholesky execution times will decrease the estimate of k.

A great test of AMX usage would be to run Cholesky on complex numbers. If OpenBLAS comes out much faster, we know Accelerate is using the AMX. The inverse statement is not true though.

ViralBShah commented 1 year ago

OpenBLAS ComplexF32 multi-threaded:

julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n) +I*n; ata = Hermitian(a'a + im*a'a); @time cholesky(ata); end
  0.010979 seconds (3 allocations: 8.000 MiB)
  0.037187 seconds (3 allocations: 32.000 MiB)
  0.240612 seconds (3 allocations: 128.000 MiB, 0.72% gc time)
  1.502579 seconds (3 allocations: 512.000 MiB, 0.50% gc time)

Accelerate ComplexF32 multi-threaded:

julia> using AppleAccelerate
julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n) +I*n; ata = Hermitian(a'a + im*a'a); @time cholesky(ata); end
  0.025775 seconds (3 allocations: 8.000 MiB)
  0.108255 seconds (3 allocations: 32.000 MiB)
  0.609296 seconds (3 allocations: 128.000 MiB, 0.29% gc time)
  3.702357 seconds (3 allocations: 512.000 MiB, 0.17% gc time)

OpenBLAS ComplexF32 1 thread:

julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n) +I*n; ata = Hermitian(a'a + im*a'a); @time cholesky(ata); end
  0.024118 seconds (3 allocations: 8.000 MiB)
  0.159501 seconds (3 allocations: 32.000 MiB, 1.11% gc time)
  1.209626 seconds (3 allocations: 128.000 MiB, 0.17% gc time)
  9.236193 seconds (3 allocations: 512.000 MiB, 0.09% gc time)

Accelerate ComplexF32 1 thread:

julia> all_n = (1024,2048,4096,8192); for n in all_n; a = rand(Float32,n,n) +I*n; ata = Hermitian(a'a + im*a'a); @time cholesky(ata); end
  0.027845 seconds (3 allocations: 8.000 MiB)
  0.116399 seconds (3 allocations: 32.000 MiB, 6.29% gc time)
  0.592916 seconds (3 allocations: 128.000 MiB, 0.48% gc time)
  4.168198 seconds (3 allocations: 512.000 MiB, 0.16% gc time)

ViralBShah commented 1 year ago

cc @chriselrod (just fyi)

philipturner commented 1 year ago

Can I have the source code you used to restrict the number of threads in Accelerate? It does not seems like MT and ST have any performance difference.

Also can you prepare some code that loops over the n=8192 version several times? I want us to measure power consumption soon.

ViralBShah commented 1 year ago

I just used the VECLIB_MAX_THREADS you had linked to in one of the comments.

ENV["VECLIB_MAX_THREADS"]=1

You can use BenchmarkTools.jl to benchmark which will do what you want. Just put @bench in front of the cholesky call.

philipturner commented 1 year ago

You can use BenchmarkTools.jl to benchmark which will do what you want. Just put @bench in front of the cholesky call.

Thanks for directions. I haven't actually installed Julia though; I'm just commenting on performance.

I just used the VECLIB_MAX_THREADS you had linked to in one of the comments.

There's a slight misspelling. You need to set VECLIB_MAXIMUM_THREADS for it to work properly. Currently it has MAX instead of MAXIMUM.

ViralBShah commented 1 year ago

You can use BenchmarkTools.jl to benchmark which will do what you want. Just put @bench in front of the cholesky call.

Thanks for directions. I haven't actually installed Julia though; I'm just commenting on performance.

Maybe this prompts you to get a bit into Julia and learn a few tricks :-) I suspected as much looking at your comments. But it is really straightforward to take what I pasted above and rerun things. I'm happy to put a script together as well.

I just used the VECLIB_MAX_THREADS you had linked to in one of the comments.

There's a slight misspelling. You need to set VECLIB_MAXIMUM_THREADS for it to work properly. Currently it has MAX instead of MAXIMUM.

The perils of doing github on the phone. I checked my session history and indeed did use MAXIMUM.

philipturner commented 1 year ago

Maybe this prompts you to get a bit into Julia and learn a few tricks :-) I suspected as much looking at your comments. But it is really straightforward to take what I pasted above and rerun things. I'm happy to put a script together as well.

I'm all for compiled languages that make code easier to maintain without sacrificing performance. I think it's ultimately my responsibility to finish my AMX benchmarks script. That would provide better data anyway. POTRF is one of the functions needed for quantum chemistry.

philipturner commented 1 year ago

Generally speaking, you only implement an optimization if there's a detectable and consistent improvement everywhere. Your stats for single-threaded complex F32 show it's definitely using AMX (vector? matrix?) instructions. However, Accelerate is slower than OpenBLAS for multithreaded complex. I would not recommend optimizing a perceived "more-common" real-valued use case, only to silently harm a "less-common" complex-valued use case.

My amx-benchmarks repo showed that with GEMM, complex-valued may be faster than on Accelerate than OpenBLAS. I didn't realize this, but interleaved complex format harms utilization of NEON units. It just harms AMX to a greater extent. Compound that with the AMX already having less vector FLOPS, the perf delta for things like Cholesky gets quite severe.

The only thing that was consistently faster: FP32 real-valued Cholesky. That is a very narrow use case, by a single-digit percentage margin. The software maintenance effort may not be worth it.

philipturner commented 1 year ago

I'm happy to put a script together as well.

Would you be willing to do the same thing you’ve already done with Cholesky, but using all the other functions described on amx-benchmarks? TRSM might be faster with AMX as well. I would think that task isn’t too difficult on your end. I’m fine if you stick with your one-shot benchmark estimate.

Also I found the outer-product version of Cholesky (which would use AMX) and the coefficient is kn^3 = (1/3)n^3 FLOPs. But each subsequent inner loop iteration does a smaller product, which could be the cause of AMX underutilization.

ViralBShah commented 1 year ago

I'm closing this since there isn't much we can do here. We'll get faster solvers as Apple updates Accelerate.

JuliaLinearAlgebra / AppleAccelerate.jl

Cholesky performance could (perhaps) be better #59