very slow on apple silicon?

AshtonSBradley commented 11 months ago

I could have sworn this used to be much faster:

using FFTW
FFTW.set_num_threads(8)
a = randn(ComplexF64,512,512);
using BenchmarkTools
@btime fft(a);
  26.528 ms (98490 allocations: 10.73 MiB)

FFTW.set_num_threads(1)
@btime fft(a);
  5.165 ms (6 allocations: 4.00 MiB)

Compare with fftw installed via python (scypy) here https://github.com/andrej5elin/howto_fftw_apple_silicon, where 4 threads takes about 500us for double precision, on slightly weaker hardware. Rosetta with mkl is also significantly (>10x) faster than fftw.jl according to those benchmarks. Am I missing something?

julia> versioninfo()
Julia Version 1.10.0-rc1
Commit 5aaa9485436 (2023-11-03 07:44 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_PKG_DEVDIR = /Users/abradley/Dropbox/Julia/Dev
  JULIA_NUM_THREADS = 8
  JULIA_PKG_SERVER = us-west.pkg.julialang.org
  JULIA_EDITOR = code

stevengj commented 11 months ago

To get high performance out of FFTW, you need to create a plan first and then re-use it. Otherwise, you are getting a lot of overhead by re-creating the plan every time.) Ideally with a pre-allocated array. Note also that FFTW shares threads with Julia, so you generally need to start Julia with enough threads (e.g. julia -t 8) if you want to run FFTW multi-threaded.

(Unfortunately, the current FFTW_jll build is missing the cycle counter on Apple silicon, which disables everything but the default FFTW.ESTIMATE plan-creation mode; that should be fixed in the next release.)

AshtonSBradley commented 11 months ago

Thanks for this

Threads.nthreads() 8 FFTW.set_num_threads(8) F=plan_fft(a,flags=FFTW.ESTIMATE); @btime F*a; 715.000 μs (122 allocations: 4.01 MiB)

a vast improvement. Compiled fftw (https://github.com/andrej5elin/howto_fftw_apple_silicon) seems to manage 350us without openmp on 4 threads with PATIENT, and more gains with openmp (for single precision 210us drops to 160us on 4 threads PATIENT).

Any scope for building with openmp using Apple's Clang?

Looking forward to the release!

giordano commented 11 months ago

Any scope for building with openmp using Apple's Clang?

Apple Clang doesn't come with OpenMP, only thing one could do is to link an external OpenMP runtime, like LLVM's.

AshtonSBradley commented 11 months ago

like this? https://github.com/andrej5elin/howto_fftw_apple_silicon#installing-fftw-with-openmp

giordano commented 11 months ago

Yes.

AshtonSBradley commented 11 months ago

is there a way to inject that into pkg>build FFTW or can one compile FFTW separately and have julia find it?

giordano commented 11 months ago

The build recipe of fftw is at https://github.com/JuliaPackaging/Yggdrasil/blob/42d73ea1c9e39c6f63bdfe065caad498257d0c6a/F/FFTW/build_tarballs.jl. At the moment OpenMP isn't used anywhere as far as I understand, I guess that's a question for @stevengj.

AshtonSBradley commented 11 months ago

Apologies: I realise now that my earlier benchmarks must have been in low power mode on the laptop.

After a charge the times are a bit more comparable, but I notice that even in-place planning gains almost nothing on M1, but has significant gains on Intel even without MKL. The slowness compared to https://github.com/andrej5elin/howto_fftw_apple_silicon has not gone away, but the gap has closed: 446.69us on 4 threads (without openmp) vs FFTW.jl below running at 699us for 8 threads

using FFTW
FFTW.set_num_threads(8)
a = randn(ComplexF64,512,512);
F = plan_fft!(a,flags=FFTW.ESTIMATE)
using BenchmarkTools

2021 M1 Max

@btime fft(a);
   699.666 μs (126 allocations: 4.01 MiB)

@btime F*a setup = (a = randn(ComplexF64,512,512));
   626.917 μs (120 allocations: 9.44 KiB)

2019 Intel 8-core i9 (no MKL)

@btime fft(a);
   1.110 ms (126 allocations: 4.01 MiB)

@btime F*a setup = (a = randn(ComplexF64,512,512));
  373.056 μs (120 allocations: 8.44 KiB)

2019 Intel 8-core i9 (with MKL)

@btime fft(a);
  528.822 μs (6 allocations: 4.00 MiB)

@btime F*a setup = (a = randn(ComplexF64,512,512));
  261.137 μs (0 allocations: 0 bytes)

stevengj commented 11 months ago

The slowness compared to https://github.com/andrej5elin/howto_fftw_apple_silicon has not gone away

In that post they are using FFTW's test/bench, (a) which defaults to FFTW.MEASURE, (b) precomputes the plans, and (c) pre-allocates the arrays. (b) can be accomplished using p = plan_fft(...), and (c) can be accomplished using mul!(output, p, input). However (a) requires a new build of FFTW that enables a cycle counter on ARM (otherwise FFTW.MEASURE will be equivalent to FFTW.ESTIMATE).

ViralBShah commented 8 months ago

Not related to this issue, but just as an fyi - Apple Silicon has been added to the CI now.

JuliaMath / FFTW.jl

very slow on apple silicon? #279