ejmahler / RustFFT

RustFFT is a high-performance FFT library written in pure Rust.
Apache License 2.0
678 stars 47 forks source link

Relax Radix4's restriction that the base FFT must be a power of 2 #131

Closed ejmahler closed 7 months ago

ejmahler commented 7 months ago

This PR relaxes the restriction for all 4 Radix4 implementations that the base FFT length must be a power of 2. For scalar, there's no longer any restriction, and for SIMD, it's restricted to a multiple of COMPLEX_PER_VECTOR * 2 to account for SIMD radix4's unrolling.

So far, we only take advantage of this relaxation in the planner when planning Bluestein's Algorithm. Before, we would attempt to plan a Mixed Radix step with a butterfly3 on one side, and a radix4 on the other side. This carried a lot of overhead, so it wasn't too much faster. Now, we use size-12 and size-24 butterflies instead of radix4's usual size-16 and size-32, so the factor of 3 gets absorbed into the base.

I benchmarked scalar and all of the SIMD implementations' performance of these strategies:

scalar_factor3

I won't bother sharing the SIMD benchmarks because they all look identical. There's two major takeaways from this chart. The first is that the new strategy ("factor 3 base") is strictly faster than the old mixed radix strategy. The second is that element-for-element, the factor 3 base FFTs are just as fast, or maybe even faster, than the strictly power of two FFTs.

This matches what I found when doing AVX benchmarking - size 3, 6, and 12 FFTs seems to be very, very competitive. On intel processors at least, my theory has been that size 6 and 12 hit a sweet spot of register utilization that 4, 8, and 16 can't quite reach.

This suggests to me that we could benefit from trying more RadixN algorithms than just 4. I bet a Radix6 algorithm would be very competitive, for example. I intend to tackle that soon.