macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon

essandess commented 1 year ago

[Solution to this issue: Use AppleAccelerate.jl.]

I'm a maintainer for the MacPorts julia port and am trying to make sure we have the best build formula for Apple Silicon.

I see a ≈2.5–3× difference in the CPU performance of Julia versus Python on basic dense matrix operations, which suggests that we may not be using the Accelerate framework appropriately. (Metal.jl benchmarks are comparable within a few TFLOPS to to PyTorch+MPS, so at least that part looks okay.)

How does one ensure that Julia is compiled to use macOS Accelerate optimizations? We use the direct build instructions provided by Julia itself, so this performance issue may arise from Julia.

https://github.com/macports/macports-ports/blob/0f6d1c42dfc3bda20673e34529c51ab34a4f3da4/lang/julia/Portfile#L57-L58

On a Mac Studio M2 Ultra, I observe that Numpy with Accelerate achieves about 2.5–3 TFLOPS for dense ops, but Julia achieves 1–1.4 TFLOPS, using both MacPorts and julialang.org binaries.

Here's some basic benchmarking code and results:

Benchmarks on Mac Studio M2 Ultra

Julia Benchmark Code

##### Julia Benchmark Code ```julia # using AppleAccelerate using BenchmarkTools using Metal using Printf j_type = Float32 for sz in [2048, 4096, 8192, 16384] a = randn(j_type, sz, sz); b = randn(j_type, sz, sz); a_mtl = MtlArray(a); b_mtl = MtlArray(b); ts = @belapsed $a * $b; ts_mtl = @belapsed ($a_mtl * $b_mtl)[1, 1]; @printf("| %d\t| %.1f\t| %.1f\t|\n", sz, sz^2*(2*sz - 1) / ts / 1e9, sz^2*(2*sz - 1) / ts_mtl / 1e9) end ```

Julia (MacPorts) Matrix Multiplication (GFLOPS)

Size	Julia	Metal.jl
2048	1068.3	11071.3
4096	1168.5	16652.6
8192	1350.6	18281.8
16384	1353.1	17988.2

Julia (julialang.org) Matrix Multiplication (GFLOPS)

Size	Julia	Metal.jl
2048	962.1	10760.9
4096	1162.4	16134.1
8192	1348.3	17379.8
16384	1322.1	17831.5

Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)

Size	AppleAccelerate.jl	Metal.jl
2048	3301.1	10474.0
4096	3588.8	16004.8
8192	4018.3	17385.5
16384	4187.6	17944.2

Python Benchmark Code

##### Python Benchmark Code ```python import numpy as np import torch import time mpsDevice = torch.device("mps" if torch.backends.mps.is_available() else "cpu") rg = np.random.default_rng(1) np_type = np.float32 torch_type = torch.float32 print("Python Matrix Multiplication (GFLOPS)\n") print("| Size\t| Numpy+Accelerate \t| PyTorch+MPS |") print("| -----:\t| -----:\t| -----: |") for size in (2048, 4096, 8192, 16384): a_np = rg.random((size, size), dtype=np_type) b_np = rg.random((size, size), dtype=np_type) a_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice) b_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice) ts_np = %timeit -n1 -r5 -q -o a_np @ b_np ts_torch = %timeit -n1 -r5 -q -o (a_torch @ b_torch)[0, 0].cpu() print("| {:d}\t| {:.1f}\t| {:.1f} |".format(size, size**2*(2*size - 1) / np.median(ts_np.all_runs) / 1e9, size**2*(2*size - 1) / np.median(ts_torch.all_runs) / 1e9)) ```

Python Matrix Multiplication (GFLOPS)

Size	Numpy+Accelerate	PyTorch+MPS
2048	2134.5	10679.2
4096	2626.8	20309.6
8192	2845.0	20988.9
16384	3015.1	19577.4

julia> versioninfo()

Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.5.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores

brenhinkeller commented 1 year ago

Interesting! How does the Apple Silicon binary from julialang.org/downloads compare?

essandess commented 1 year ago

I now include julialang.org results above, which are comparable to the performance of MacPorts binaries. I also updated the Julia benchmarks to use a more accurate BenchmarkTools.jl timing with @belapsed.

giordano commented 1 year ago

Did you try AppleAccelerate.jl, which doesn't require rebuilding Julia?

essandess commented 1 year ago

That you! That's what I was looking for:

Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)

Size	AppleAccelerate.jl	Metal.jl
2048	3301.1	10474.0
4096	3588.8	16004.8
8192	4018.3	17385.5
16384	4187.6	17944.2

essandess commented 1 year ago

I’m going to reopen this issue as a feature request for compiled Julia on Apple Silicon to use the Accelerate framework by default. I cannot imagine a scenario where one wouldn’t want this—it’s a factor of 3–4x performance.

ctkelley commented 1 year ago

How large were these benchmarks? Were matrix factorizations part of it? The discussion here seemed to conclude that AA was indeed much better for matrix-matrix multiply but often worse on LU. Nor was it easy to see the O(N^3) cost as the dimension increased..

I'd stick with OpenBLAS.

essandess commented 1 year ago

I haven't seen comprehensive benchmarks published, but it's easy to run a few cases. I observe a 3–4x speedup for BLAS, and comparable performance on standard matrix decompositions, with the exception of the large, dense SVDs.

MY own preference based on my own workloads and observations is a preference for an Accelerate framework default.

Julia Benchmark Code

```julia using BenchmarkTools using LinearAlgebra using Printf # using AppleAccelerate j_type = Float32 for sz in [2048, 4096, 8192, 16384] a = randn(j_type, sz, sz); ts = @belapsed qr($a); @printf("| %d\t| %.1f\t|\n", sz, 2*sz^2*(sz - sz/3) / ts / 1e9) end for sz in [2048, 4096, 8192, 16384] a = randn(j_type, sz, sz); ts = @belapsed svd($a); @printf("| %d\t| %.1f\t|\n", sz, sz^3*(2 + 11) / ts / 1e9) end ```

Julia Matrix Decompositions (GFLOPS)

Size	QR (OpenBLAS)	QR (AA)	SVD (OpenBLAS)	SVD (AA)
2048	167.3	122.9	122.3	163.0
4096	226.2	189.3	134.7	98.7
8192	300.5	316.9	198.1	80.9
16384	359.2	370.8	255.8	81.6

JuliaLang / julia