Suggestions for improving performance?

robclouth commented 9 years ago

Hi, I've been using this FFT code: http://www.apo33.org/pub/puredata/APO/librairies_PD/recup/paraloeuil_v1_pd/src/d_mayer_fft.c But the performance wasn't what I expected so I switched to FFTS. However, I'm finding that FFTS performs slightly worse than this code which is surprising. I'm using it for real-time audio processing for a VST and I ideally would like up to 100 simultaneous 1024 size FFTs running. I can only get 30 at the moment before the cpu gets to 40%, which is too much for an audio plugin. I'm developing on Windows. I pre-built the lib with cmake. I'm allocating the aligned memory like this:

    static void* fft_alloc(int size) {
        float FFTS_ALIGN(32) *data = (float*)_aligned_malloc(size, 32);
        memset(data, 0, size);
        return data;
    }

And this is the method than executes it:

void FFT::forward_real(int size, const void* in, void* out){
    init(size);
    ffts_execute(real_fftPlan, in, out);
}

init(size); only does anything if the size has changed, so it's not initializing the fft every execution. Any idea as to what's slowing things down?

Thanks

lmdsp commented 9 years ago

Here on my cheap mobile core i3 a 1024-point fft takes 4µs so 100x of those would take 0.4ms. Assuming 44k sample rate and 2x overlap ratio between frames, you'd have to do that approx. 90 times per second, so we'd use up to 90*0.4= 36ms, or about 3% of the CPU That should leave plenty of time for other tasks

I suggest you profile your code and check all compiler optimizations are turned on (SSE2 etc)

robclouth commented 9 years ago

Hi, yeah it did seem ridiculously slow. Sorry I'm a n00b when it comes to this kind of stuff. I've turned on SSE2 in the project settings. Is there anything else? Profiling the code reveals that V4SF_K_N, V4SF_IMUL and V4SF_IMULJ are collectively eating most of the cpu time.

lmdsp commented 9 years ago

Have you defined the HAVE_SSE macro ?

robclouth commented 9 years ago

Yeah, I have already. With release build optimizations it's acceptable for now. Thanks for the help.

On Sun, Mar 15, 2015 at 10:06 PM, Lorcan Mc Donagh <notifications@github.com

wrote:

Have you defined the _HAVESSE macro ?

— Reply to this email directly or view it on GitHub https://github.com/anthonix/ffts/issues/35#issuecomment-81240205.

linkotec commented 9 years ago

I am also using and developing FFTS for real-time audio processing so.. Which version/fork you are using? Your target is Windows, which compiler? Which OS? Did you notice that there is only partial support for 32 bit version? If you are seeing V4SF_K_N spending most of CPU time that would suggest that you are building 32 bit version, can you build 64 bit version? Check macros.h and verify that macros-sse.h is actually selected. To give some numbers, on my Core 2 with MSVC 2005 32 bit debug build gives ~500 Mflops, release build ~3000 Mflops, but MSVC 2005 64 bit gives ~13300 Mflops. To compare my numbers with your setup, you can use https://github.com/linkotec/benchFFTS, batch files included. Also you should be able to reuse the same FFTS plan even if you are multi-threading.

linkotec commented 9 years ago

I have just added support for MinGW building, and that should help your building for 32 bit. mingw-w64/i686-4.9.2-posix-dwarf-rt_v3-rev1 gives ~10500 Mflops, which shows just how much better optimizer GCC is.

robclouth commented 8 years ago

Hey, so I'm back to this again. I've been using PFFFT, but thought I'd try FFTS again. I still can't get it to be as fast as PFFFT. I'm building on 64bit, OSX El Capitan. XCode. I used cmake to generate the xcode project FFTS. The USE_SSE macro is defined. Should I enable any other options in cmake? ENABLE_VFP?

anthonix / ffts

Suggestions for improving performance? #35