Open RemiLacroix-IDRIS opened 3 years ago
Any ideas? It really feels like there must be a bug with fftw_plan_dft_r2c_3d
(and maybe other fftw_plan_*
functions) when using AVX512.
Can you provide your code and detail information about your platform? I've done some work on porting fftw to fit RISC SIMD instructions, there is also a low performance when using multiple threads. Seems that the function of cost calculation sometime not working correctly when using SIMD instructions. But the false sharing of cache may also cause the problem.
Can you provide your code and detail information about your platform?
The code is attached to this issue. The platform has 2 Intel Cascade Lake 6248 (so 2x20 cores @ 2,5 GHz) per node.
Seems that the function of cost calculation sometime not working correctly when using SIMD instructions. But the false sharing of cache may also cause the problem.
The problem is mostly with fftw_plan_dft_r2c_3d
and only happens with threads and AVX512 (AVX2 works fine).
OK, I will see what I can do.
Hello,
We are seeing some performance issues with AVX512 on your Intel Cascade Lake-based machine.
I am attaching a simplified test case to this issue: testfft.zip.
Compile with :
ifort -O3 -qopenmp -I/.../fftw/include -L.../fftw/lib -lfftw3 -lfftw3_omp -o testfft testfft.f90
.The full logs are available for
FFTW_ESTIMATE
(res_estimate.txt) andFFTW_MEASURE
(res_measure.txt).There are two different issues:
fftw_execute_dft_r2c
, especially when using multiple threads.fftw_plan_dft_r2c_3d
is order of magnitudes slower when using the AVX512-enabled build than when using the AVX2-enabled build (for example forFFTW_ESTIMATE
and 10 threads: 0.0168 s vs 16.6067 s). The more threads you add, the slower it gets (FFTW_ESTIMATE
and 20 threads: 0.0208 s vs 39.67 s).The first issue might just be because of the hardware but the second one really feels like there is a bug.
Best regards, Rémi