FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)
GNU General Public License v2.0
2.73k stars 665 forks source link

version with neon acceleration is slower than normal ones #129

Open waterball opened 6 years ago

waterball commented 6 years ago

I compile fftw for android. it turns out the one with neon acceleration is somehow slower than the one without neon.

compile commands as follows: NDK_DIR="/home/meishe01/cx/kit/android-ndk-r12b" INSTALL_DIR="pwd/build-android/fftw3"

export PATH="$NDK_DIR/toolchains/arm-linux-androideabi-4.9/prebuilt/linux-x86_64/bin/:$PATH" export SYS_ROOT="$NDK_DIR/platforms/android-16/arch-arm/" export CC="arm-linux-androideabi-gcc --sysroot=$SYS_ROOT -march=armv7-a -mfloat-abi=softfp" export LD="arm-linux-androideabi-ld" export AR="arm-linux-androideabi-ar" export RANLIB="arm-linux-androideabi-ranlib" export STRIP="arm-linux-androideabi-strip" export CFLAGS="-mfpu=neon -mfloat-abi=softfp"

mkdir -p $INSTALL_DIR ./configure --with-slow-timer --host=arm-linux-gnueabi --prefix=$INSTALL_DIR LIBS="-lc -lgcc" --enable-neon --enable-float

./configure --with-slow-timer --host=arm-linux-gnueabi --prefix=$INSTALL_DIR LIBS="-lc -lgcc" --enable-float

make -j4 make install

I build two versions, one with neon and the other without neon, the only difference is the configure command.

I tried both version on the same phone, meizu mx4 pro and PLK-AL10, and counted the time spent only on fftwf_execute operations(R2C and C2R).

Any suggestions?

waterball commented 6 years ago

I've written a speed testing sample in Android Studio for R2C operations. At the same time, I've debuged into the simd_neon.h files so that I'm pretty sure that fftw uses neon to optimize fft. Sample code is here, Code snippets as:

    const int size = 32;
    float *in = (float *)fftwf_malloc(size * size * sizeof(float));
    fftwf_complex *out = (fftwf_complex *)fftwf_malloc(size * ((size / 2 + 1) * 2) * sizeof(float));
    fftwf_plan p = fftwf_plan_dft_r2c_2d(size, size, in, out, FFTW_ESTIMATE);

    timeval begin, end;
    double elapse;
    gettimeofday(&begin, 0);
    for (int i = 0; i < 1000; ++i)
        fftwf_execute(p);
    gettimeofday(&end, 0);
    elapse = 1000.0 * (end.tv_sec - begin.tv_sec) + (end.tv_usec - begin.tv_usec) / 1000.0;
    elapse = elapse / 1000.0;

    char elapse_s[100];
    sprintf(elapse_s, "Elapse: %f ms\n", elapse);
    fftwf_destroy_plan(p);
    fftwf_free(in);
    fftwf_free(out); 

I tested R2C 2d operation on Meizu mx4, and came to a wierd result as follows:

size 32x32 64x64 80x80 128x128
with neon 0.053 ms 0.38 ms 1.19 ms 2.8 ms
without neon 0.041 ms 0.46 ms 0.62 ms 2.66 ms

We see that in most cases fftw without neon is faster than the other. If the way I use FFTW is wrong, correct me. Thanks!!!

ast commented 5 years ago

I also find the neon version slower...

NPellet commented 4 years ago

I got there because I noticed the same behaviour.

Actually I'm a bit confused about the -mfpu option. In configure.ac, the following appears

        case "${host_cpu}" in
            aarch64)
                ;;
            *)
                if test "$have_neon" = "yes" -a "x$NEON_CFLAGS" = x; then
                    AX_CHECK_COMPILER_FLAGS(-mfpu=neon, [NEON_CFLAGS="-mfpu=neon"],
                [AC_MSG_ERROR([Need a version of gcc with -mfpu=neon])])
                fi
                ;;
        esac

But the aarch64 reference states that the -mfpu flag is ignored ("-mfpu=list is rejected when targeting AArch64.", see https://developer.arm.com/documentation/100067/0608/armclang-Command-line-Options/-mfpu?lang=en)

The documention goes on to state that the -mcpu option is the relevant one for aarch64.