JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.4k stars 5.46k forks source link

macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon #50806

Open essandess opened 1 year ago

essandess commented 1 year ago

[Solution to this issue: Use AppleAccelerate.jl.]

I'm a maintainer for the MacPorts julia port and am trying to make sure we have the best build formula for Apple Silicon.

I see a ≈2.5–3× difference in the CPU performance of Julia versus Python on basic dense matrix operations, which suggests that we may not be using the Accelerate framework appropriately. (Metal.jl benchmarks are comparable within a few TFLOPS to to PyTorch+MPS, so at least that part looks okay.)

How does one ensure that Julia is compiled to use macOS Accelerate optimizations? We use the direct build instructions provided by Julia itself, so this performance issue may arise from Julia.

https://github.com/macports/macports-ports/blob/0f6d1c42dfc3bda20673e34529c51ab34a4f3da4/lang/julia/Portfile#L57-L58

On a Mac Studio M2 Ultra, I observe that Numpy with Accelerate achieves about 2.5–3 TFLOPS for dense ops, but Julia achieves 1–1.4 TFLOPS, using both MacPorts and julialang.org binaries.

Here's some basic benchmarking code and results:

Benchmarks on Mac Studio M2 Ultra

Julia Benchmark Code ##### Julia Benchmark Code ```julia # using AppleAccelerate using BenchmarkTools using Metal using Printf j_type = Float32 for sz in [2048, 4096, 8192, 16384] a = randn(j_type, sz, sz); b = randn(j_type, sz, sz); a_mtl = MtlArray(a); b_mtl = MtlArray(b); ts = @belapsed $a * $b; ts_mtl = @belapsed ($a_mtl * $b_mtl)[1, 1]; @printf("| %d\t| %.1f\t| %.1f\t|\n", sz, sz^2*(2*sz - 1) / ts / 1e9, sz^2*(2*sz - 1) / ts_mtl / 1e9) end ```
Julia (MacPorts) Matrix Multiplication (GFLOPS)
Size Julia Metal.jl
2048 1068.3 11071.3
4096 1168.5 16652.6
8192 1350.6 18281.8
16384 1353.1 17988.2
Julia (julialang.org) Matrix Multiplication (GFLOPS)
Size Julia Metal.jl
2048 962.1 10760.9
4096 1162.4 16134.1
8192 1348.3 17379.8
16384 1322.1 17831.5
Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)
Size AppleAccelerate.jl Metal.jl
2048 3301.1 10474.0
4096 3588.8 16004.8
8192 4018.3 17385.5
16384 4187.6 17944.2
Python Benchmark Code ##### Python Benchmark Code ```python import numpy as np import torch import time mpsDevice = torch.device("mps" if torch.backends.mps.is_available() else "cpu") rg = np.random.default_rng(1) np_type = np.float32 torch_type = torch.float32 print("Python Matrix Multiplication (GFLOPS)\n") print("| Size\t| Numpy+Accelerate \t| PyTorch+MPS |") print("| -----:\t| -----:\t| -----: |") for size in (2048, 4096, 8192, 16384): a_np = rg.random((size, size), dtype=np_type) b_np = rg.random((size, size), dtype=np_type) a_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice) b_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice) ts_np = %timeit -n1 -r5 -q -o a_np @ b_np ts_torch = %timeit -n1 -r5 -q -o (a_torch @ b_torch)[0, 0].cpu() print("| {:d}\t| {:.1f}\t| {:.1f} |".format(size, size**2*(2*size - 1) / np.median(ts_np.all_runs) / 1e9, size**2*(2*size - 1) / np.median(ts_torch.all_runs) / 1e9)) ```
Python Matrix Multiplication (GFLOPS)
Size Numpy+Accelerate PyTorch+MPS
2048 2134.5 10679.2
4096 2626.8 20309.6
8192 2845.0 20988.9
16384 3015.1 19577.4
julia> versioninfo()

Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.5.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores
brenhinkeller commented 1 year ago

Interesting! How does the Apple Silicon binary from julialang.org/downloads compare?

essandess commented 1 year ago

I now include julialang.org results above, which are comparable to the performance of MacPorts binaries. I also updated the Julia benchmarks to use a more accurate BenchmarkTools.jl timing with @belapsed.

giordano commented 1 year ago

Did you try AppleAccelerate.jl, which doesn't require rebuilding Julia?

essandess commented 1 year ago

That you! That's what I was looking for:

Julia (AppleAccelerate.jl) Matrix Multiplication (GFLOPS)
Size AppleAccelerate.jl Metal.jl
2048 3301.1 10474.0
4096 3588.8 16004.8
8192 4018.3 17385.5
16384 4187.6 17944.2
essandess commented 1 year ago

I’m going to reopen this issue as a feature request for compiled Julia on Apple Silicon to use the Accelerate framework by default. I cannot imagine a scenario where one wouldn’t want this—it’s a factor of 3–4x performance.

ctkelley commented 1 year ago

How large were these benchmarks? Were matrix factorizations part of it? The discussion here seemed to conclude that AA was indeed much better for matrix-matrix multiply but often worse on LU. Nor was it easy to see the O(N^3) cost as the dimension increased..

I'd stick with OpenBLAS.

essandess commented 1 year ago

I haven't seen comprehensive benchmarks published, but it's easy to run a few cases. I observe a 3–4x speedup for BLAS, and comparable performance on standard matrix decompositions, with the exception of the large, dense SVDs.

MY own preference based on my own workloads and observations is a preference for an Accelerate framework default.

Julia Benchmark Code
Julia Benchmark Code ```julia using BenchmarkTools using LinearAlgebra using Printf # using AppleAccelerate j_type = Float32 for sz in [2048, 4096, 8192, 16384] a = randn(j_type, sz, sz); ts = @belapsed qr($a); @printf("| %d\t| %.1f\t|\n", sz, 2*sz^2*(sz - sz/3) / ts / 1e9) end for sz in [2048, 4096, 8192, 16384] a = randn(j_type, sz, sz); ts = @belapsed svd($a); @printf("| %d\t| %.1f\t|\n", sz, sz^3*(2 + 11) / ts / 1e9) end ```
Julia Matrix Decompositions (GFLOPS)
Size QR (OpenBLAS) QR (AA) SVD (OpenBLAS) SVD (AA)
2048 167.3 122.9 122.3 163.0
4096 226.2 189.3 134.7 98.7
8192 300.5 316.9 198.1 80.9
16384 359.2 370.8 255.8 81.6