Closed sultanqasim closed 10 months ago
Time to calculate apply window function, perform FFT, then calculate real value. Benchmarked on a MacBook Air (M1) on a debug build at 2 Msps, 60 fps, with various FFT sizes. Overlapping buffers for FFT size 65536 and up.
Volk + FFT threads:
Size Time (us)
16384 170
32768 350
65536 500
131072 1000
262144 2300
FFT threads only:
Size Time (us)
16384 250 (single thread)
32768 450 (multi-thread, 550 single thread)
65536 1000
131072 1600
262144 3600
Original:
Size Time (us)
16384 200
32768 400
65536 1450
131072 2200
262144 4900
I'm not quite sure why just linking libfftw3f_threads made the FFTs a bit slower, even with no other code changes. However, I did test to see that multiple threads improved performance relative to a single thread for FFT sizes of 32768 and higher.
Perhaps old FFT wisdom? Remove the wisdom file from ~/.suscan, open SigDigger, capture some samples, close SigDigger (gracefully), open again and profile.
I tried clearing out wisdom multiple times when I was testing but still had the issue. Interestingly, when I tried changing the planner mode to FFTW_PATIENT
or FFTW_EXHAUSTVE
, FFTs became a lot slower instead of faster (I'm not talking about planning time but actual FFT execution time).
I accidentally benchmarked on a debug build earlier, so I did on a release build now. Actually, it seems the volk-ization was unnecessary, at least on my machine (ARM Mac). Seems the compiler was already SIMD-ifying these loops with -O3. Maybe Volk does better on x86, you can check. Othewise, the Volk change can be replaced with just that memcpy simplification I did for the overlapping buffers case.
Volk + FFT threads:
Size Time (us)
8192 53
16384 115
32768 250
65536 570
131072 820
262144 2100
FFT threads only:
Size Time (us)
8192 50
16384 110
32768 250
65536 550
131072 800
262144 2100
Original:
Size Time (us)
8192 50
16384 105
32768 220
65536 750
131072 1200
262144 3100
Note: while FFTW3F runs clearly faster with multiple threads for size 65536, and slightly faster with multiple threads for size 32768, on my computer in my experiments with libfftw3f_threads linked, the linking of libfftw3f_threads has the strange side-effect of adding a lot of jitter to the computation time of FFTs. For example, the 8192 size FFTs sometimes take 50ms, and other times 150+ ms. I wonder if libfftw3f_threads is causing the FFT to be done in a separate worker thread even if the planner is asked to only use one thread.
While it seems using Volk actually slightly worsens performance on release builds in my M1 MacBook Air, this one subset of the volk-ization commit did improve performance as expected: https://github.com/sultanqasim/sigutils/commit/8756fd87b72ad8c115b1b62cde5dfbcc445d90c2
If Volk doesn't help on your machine either, then perhaps we should not use Volk for these calculations, but still do that memcpy optimization to avoid branching inside the loop.
I added proper locking to protect the non-thread safe fftw planner. This seems to fix the jitter issues I was having with single threaded work still happening in another thread. I now migrated all the FFT operations to use this multi-threaded and lock protected planner. I use two threads instead of four for 32k FFT size, as the performance benefit of four threads is negligible on my machine, but four threads does result in a lot more jitter and inefficiency at 32k size.
The last commit of this series (the volkization) is completely useless on my M1 MacBook Air, actually it slightly worsens performance (on release builds). I don't have an x86 machine handy to test if there's any benefit there, but assuming not, that last volk-ization commit should be discarded.
Got rid of the volk commit for now, and switched to FFTW's built in thread safety mechanism. Working great for me.
Documentation coverage report
Element | Value |
---|---|
Defines | 0.3% (1/352) |
Enum Values | 0.0% (0/38) |
Enums | 0.0% (0/12) |
Files | 0.0% (0/91) |
Functions | 0.0% (0/691) |
Pages | 100.0% (3/3) |
Structs | 2.1% (1/47) |
Typedefs | 0.0% (0/40) |
Variables | 2.5% (12/485) |
Total | 1.0% (17/1759) |
Thanks! Testing it now
Use multiple threads for large FFT sizes, and volk-ify hot loops. Needs
fftwf_init_threads()
to be called before sigutils is used (perhaps fromsuscan_sigutils_init()
).