Open AshtonSBradley opened 11 months ago
To get high performance out of FFTW, you need to create a plan first and then re-use it. Otherwise, you are getting a lot of overhead by re-creating the plan every time.) Ideally with a pre-allocated array. Note also that FFTW shares threads with Julia, so you generally need to start Julia with enough threads (e.g. julia -t 8
) if you want to run FFTW multi-threaded.
(Unfortunately, the current FFTW_jll build is missing the cycle counter on Apple silicon, which disables everything but the default FFTW.ESTIMATE
plan-creation mode; that should be fixed in the next release.)
Thanks for this
Threads.nthreads() 8 FFTW.set_num_threads(8) F=plan_fft(a,flags=FFTW.ESTIMATE); @btime F*a; 715.000 μs (122 allocations: 4.01 MiB)
a vast improvement. Compiled fftw (https://github.com/andrej5elin/howto_fftw_apple_silicon) seems to manage 350us without openmp on 4 threads with PATIENT, and more gains with openmp (for single precision 210us drops to 160us on 4 threads PATIENT).
Any scope for building with openmp using Apple's Clang?
Looking forward to the release!
Any scope for building with openmp using Apple's Clang?
Apple Clang doesn't come with OpenMP, only thing one could do is to link an external OpenMP runtime, like LLVM's.
Yes.
is there a way to inject that into pkg>build FFTW or can one compile FFTW separately and have julia find it?
The build recipe of fftw is at https://github.com/JuliaPackaging/Yggdrasil/blob/42d73ea1c9e39c6f63bdfe065caad498257d0c6a/F/FFTW/build_tarballs.jl. At the moment OpenMP isn't used anywhere as far as I understand, I guess that's a question for @stevengj.
Apologies: I realise now that my earlier benchmarks must have been in low power mode on the laptop.
After a charge the times are a bit more comparable, but I notice that even in-place planning gains almost nothing on M1, but has significant gains on Intel even without MKL. The slowness compared to https://github.com/andrej5elin/howto_fftw_apple_silicon has not gone away, but the gap has closed: 446.69us on 4 threads (without openmp) vs FFTW.jl below running at 699us for 8 threads
using FFTW
FFTW.set_num_threads(8)
a = randn(ComplexF64,512,512);
F = plan_fft!(a,flags=FFTW.ESTIMATE)
using BenchmarkTools
2021 M1 Max
@btime fft(a);
699.666 μs (126 allocations: 4.01 MiB)
@btime F*a setup = (a = randn(ComplexF64,512,512));
626.917 μs (120 allocations: 9.44 KiB)
2019 Intel 8-core i9 (no MKL)
@btime fft(a);
1.110 ms (126 allocations: 4.01 MiB)
@btime F*a setup = (a = randn(ComplexF64,512,512));
373.056 μs (120 allocations: 8.44 KiB)
2019 Intel 8-core i9 (with MKL)
@btime fft(a);
528.822 μs (6 allocations: 4.00 MiB)
@btime F*a setup = (a = randn(ComplexF64,512,512));
261.137 μs (0 allocations: 0 bytes)
The slowness compared to https://github.com/andrej5elin/howto_fftw_apple_silicon has not gone away
In that post they are using FFTW's test/bench
, (a) which defaults to FFTW.MEASURE
, (b) precomputes the plans, and (c) pre-allocates the arrays. (b) can be accomplished using p = plan_fft(...)
, and (c) can be accomplished using mul!(output, p, input)
. However (a) requires a new build of FFTW that enables a cycle counter on ARM (otherwise FFTW.MEASURE
will be equivalent to FFTW.ESTIMATE
).
Not related to this issue, but just as an fyi - Apple Silicon has been added to the CI now.
I could have sworn this used to be much faster:
Compare with fftw installed via python (scypy) here https://github.com/andrej5elin/howto_fftw_apple_silicon, where 4 threads takes about 500us for double precision, on slightly weaker hardware. Rosetta with mkl is also significantly (>10x) faster than fftw.jl according to those benchmarks. Am I missing something?