BatchDrake / sigutils

Small signal processing utility library
https://batchdrake.github.io/sigutils
GNU General Public License v3.0
71 stars 29 forks source link

Optimize FFT calculations #76

Closed sultanqasim closed 10 months ago

sultanqasim commented 10 months ago

Use multiple threads for large FFT sizes, and volk-ify hot loops. Needs fftwf_init_threads() to be called before sigutils is used (perhaps from suscan_sigutils_init()).

sultanqasim commented 10 months ago

Time to calculate apply window function, perform FFT, then calculate real value. Benchmarked on a MacBook Air (M1) on a debug build at 2 Msps, 60 fps, with various FFT sizes. Overlapping buffers for FFT size 65536 and up.

Volk + FFT threads:
Size        Time (us)
16384       170
32768       350
65536       500
131072      1000
262144      2300

FFT threads only:
Size        Time (us)
16384       250 (single thread)
32768       450 (multi-thread, 550 single thread)
65536       1000
131072      1600
262144      3600

Original:
Size        Time (us)
16384       200
32768       400
65536       1450
131072      2200
262144      4900

I'm not quite sure why just linking libfftw3f_threads made the FFTs a bit slower, even with no other code changes. However, I did test to see that multiple threads improved performance relative to a single thread for FFT sizes of 32768 and higher.

BatchDrake commented 10 months ago

Perhaps old FFT wisdom? Remove the wisdom file from ~/.suscan, open SigDigger, capture some samples, close SigDigger (gracefully), open again and profile.

sultanqasim commented 10 months ago

I tried clearing out wisdom multiple times when I was testing but still had the issue. Interestingly, when I tried changing the planner mode to FFTW_PATIENT or FFTW_EXHAUSTVE, FFTs became a lot slower instead of faster (I'm not talking about planning time but actual FFT execution time).

sultanqasim commented 10 months ago

I accidentally benchmarked on a debug build earlier, so I did on a release build now. Actually, it seems the volk-ization was unnecessary, at least on my machine (ARM Mac). Seems the compiler was already SIMD-ifying these loops with -O3. Maybe Volk does better on x86, you can check. Othewise, the Volk change can be replaced with just that memcpy simplification I did for the overlapping buffers case.

Volk + FFT threads:
Size        Time (us)
8192        53
16384       115
32768       250
65536       570
131072      820
262144      2100

FFT threads only:
Size        Time (us)
8192        50
16384       110
32768       250
65536       550
131072      800
262144      2100

Original:
Size        Time (us)
8192        50
16384       105
32768       220
65536       750
131072      1200
262144      3100

Note: while FFTW3F runs clearly faster with multiple threads for size 65536, and slightly faster with multiple threads for size 32768, on my computer in my experiments with libfftw3f_threads linked, the linking of libfftw3f_threads has the strange side-effect of adding a lot of jitter to the computation time of FFTs. For example, the 8192 size FFTs sometimes take 50ms, and other times 150+ ms. I wonder if libfftw3f_threads is causing the FFT to be done in a separate worker thread even if the planner is asked to only use one thread.

sultanqasim commented 10 months ago

While it seems using Volk actually slightly worsens performance on release builds in my M1 MacBook Air, this one subset of the volk-ization commit did improve performance as expected: https://github.com/sultanqasim/sigutils/commit/8756fd87b72ad8c115b1b62cde5dfbcc445d90c2

If Volk doesn't help on your machine either, then perhaps we should not use Volk for these calculations, but still do that memcpy optimization to avoid branching inside the loop.

sultanqasim commented 10 months ago

I added proper locking to protect the non-thread safe fftw planner. This seems to fix the jitter issues I was having with single threaded work still happening in another thread. I now migrated all the FFT operations to use this multi-threaded and lock protected planner. I use two threads instead of four for 32k FFT size, as the performance benefit of four threads is negligible on my machine, but four threads does result in a lot more jitter and inefficiency at 32k size.

sultanqasim commented 10 months ago

The last commit of this series (the volkization) is completely useless on my M1 MacBook Air, actually it slightly worsens performance (on release builds). I don't have an x86 machine handy to test if there's any benefit there, but assuming not, that last volk-ization commit should be discarded.

sultanqasim commented 10 months ago

Got rid of the volk commit for now, and switched to FFTW's built in thread safety mechanism. Working great for me.

github-actions[bot] commented 10 months ago

Documentation coverage report

Element Value
Defines 0.3% (1/352)
Enum Values 0.0% (0/38)
Enums 0.0% (0/12)
Files 0.0% (0/91)
Functions 0.0% (0/691)
Pages 100.0% (3/3)
Structs 2.1% (1/47)
Typedefs 0.0% (0/40)
Variables 2.5% (12/485)
Total 1.0% (17/1759)
BatchDrake commented 10 months ago

Thanks! Testing it now