FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)
GNU General Public License v2.0
2.67k stars 652 forks source link

Cross compiling FFTW for ARM Neon #280

Open damaBeugXam opened 2 years ago

damaBeugXam commented 2 years ago

I am trying to compile FFTW3 to run on ARM Neon (More precisely, on a Cortex a-53). The build env is x86_64-pokysdk-lunix, The host env is aarch64-poky-lunix. I am using the aarch64-poky-linux-gcc compiler. I used the following command at first: image The compiler did not support the -mfloat-abi=softfp and the -mfpu=neon. It also did not let me define the path to the sysroot this way. Then used the following command: image This command succeeded with this config log and this config.h. Then I used the command make then make install. I then copied my shared library file into my host env and used fftwf_ instead of fftw_ in my code base. The final step was to recompile the program. I ran a test and compared the times for both algorithm using <sys/resource.h>. I also used the fftw[f]_forget_wisdom() on both algorithms so that It can be fair. However, I am not getting a speedup. I believe that using an SIMD architecture (NEON in our case) would accelerate the FFTW library. I would really appreciate if anyone can point out something that I am doing wrong so that I can try a fix and see if I can get the performance boost I am looking for.

rdolbeau commented 2 years ago

(a) efficient use of SIMD & scalar requires planning with 'measure' or higher and not 'estimate' (although 'estimate' will likely favor SIMD over scalar); (b) you can use fftw_print_plan() to see the plan and check NEON is being used; the format is not very readable, but codelets appear as 'n1_16' and similar with SIMD as suffix, such as 'n1_16_neon' ; if NEON is not used, try with both 'estimate' and 'measure' as the library will ignore SIMD in 'measure' mode if it's not faster than scalar; (c) the A53 may not have any significant performance advantage running NEON code over regular scalar code; it was designed for mobile efficiency rather than FP performance (and it's nearly a decade old by now). I don't have numbers for the A53, but for instance the A7 took 4x as long to do 4x (Q-form) FP32 SIMD as to do scalar FP32, so NEON isn't very useful on the A7 except for the extra register space.