FFTW / fftw3

GNU General Public License v2.0
2.67k stars 652 forks source link

Cross compiling FFTW for ARM Neon #280

Open damaBeugXam opened 2 years ago

damaBeugXam commented 2 years ago

I am trying to compile FFTW3 to run on ARM Neon (More precisely, on a Cortex a-53). The build env is x86_64-pokysdk-lunix, The host env is aarch64-poky-lunix. I am using the aarch64-poky-linux-gcc compiler. I used the following command at first: image The compiler did not support the -mfloat-abi=softfp and the -mfpu=neon. It also did not let me define the path to the sysroot this way. Then used the following command: image This command succeeded with this config log and this config.h. Then I used the command make then make install. I then copied my shared library file into my host env and used fftwf_ instead of fftw_ in my code base. The final step was to recompile the program. I ran a test and compared the times for both algorithm using <sys/resource.h>. I also used the fftw[f]_forget_wisdom() on both algorithms so that It can be fair. However, I am not getting a speedup. I believe that using an SIMD architecture (NEON in our case) would accelerate the FFTW library. I would really appreciate if anyone can point out something that I am doing wrong so that I can try a fix and see if I can get the performance boost I am looking for.

rdolbeau commented 2 years ago

(a) efficient use of SIMD & scalar requires planning with 'measure' or higher and not 'estimate' (although 'estimate' will likely favor SIMD over scalar); (b) you can use fftw_print_plan() to see the plan and check NEON is being used; the format is not very readable, but codelets appear as 'n1_16' and similar with SIMD as suffix, such as 'n1_16_neon' ; if NEON is not used, try with both 'estimate' and 'measure' as the library will ignore SIMD in 'measure' mode if it's not faster than scalar; (c) the A53 may not have any significant performance advantage running NEON code over regular scalar code; it was designed for mobile efficiency rather than FP performance (and it's nearly a decade old by now). I don't have numbers for the A53, but for instance the A7 took 4x as long to do 4x (Q-form) FP32 SIMD as to do scalar FP32, so NEON isn't very useful on the A7 except for the extra register space.