Performances issues with AVX512 on Intel Cascade Lake

RemiLacroix-IDRIS commented 3 years ago

Hello,

We are seeing some performance issues with AVX512 on your Intel Cascade Lake-based machine.

I am attaching a simplified test case to this issue: testfft.zip.

Compile with : ifort -O3 -qopenmp -I/.../fftw/include -L.../fftw/lib -lfftw3 -lfftw3_omp -o testfft testfft.f90.

The full logs are available for FFTW_ESTIMATE (res_estimate.txt) and FFTW_MEASURE (res_measure.txt).

There are two different issues:

The benefit of using AVX512 is unclear for fftw_execute_dft_r2c, especially when using multiple threads.
When using multiple threads, fftw_plan_dft_r2c_3d is order of magnitudes slower when using the AVX512-enabled build than when using the AVX2-enabled build (for example for FFTW_ESTIMATE and 10 threads: 0.0168 s vs 16.6067 s). The more threads you add, the slower it gets (FFTW_ESTIMATE and 20 threads: 0.0208 s vs 39.67 s).

The first issue might just be because of the hardware but the second one really feels like there is a bug.

Best regards, Rémi

RemiLacroix-IDRIS commented 3 years ago

Any ideas? It really feels like there must be a bug with fftw_plan_dft_r2c_3d (and maybe other fftw_plan_* functions) when using AVX512.

Lqlsoftware commented 3 years ago

Can you provide your code and detail information about your platform? I've done some work on porting fftw to fit RISC SIMD instructions, there is also a low performance when using multiple threads. Seems that the function of cost calculation sometime not working correctly when using SIMD instructions. But the false sharing of cache may also cause the problem.

RemiLacroix-IDRIS commented 3 years ago

Can you provide your code and detail information about your platform?

The code is attached to this issue. The platform has 2 Intel Cascade Lake 6248 (so 2x20 cores @ 2,5 GHz) per node.

Seems that the function of cost calculation sometime not working correctly when using SIMD instructions. But the false sharing of cache may also cause the problem.

The problem is mostly with fftw_plan_dft_r2c_3d and only happens with threads and AVX512 (AVX2 works fine).

Lqlsoftware commented 3 years ago

OK, I will see what I can do.

FFTW / fftw3

Performances issues with AVX512 on Intel Cascade Lake #220