RadixN: Like Radix4, but supports size 2,3,4,5,6, and 7 cross-FFTs all in the same instance

This PR implements a new FFT algorithm called RadixN, and integrates it into the planner. It efficiently handles FFT sizes containing small prime factors. So far, it's scalar-only, but I intend to port it over to the SIMD implementations next.

The key insight here is that even though we use the same cross-FFT size at every "level" of Radix4, there's no intrinsic limitation that forces us to do so - that's just what we need for powers of four, so that's what radix4 does.

RadixN removes this restriction, allowing every cross-FFT to have a different butterfly size. I picked 2,3,4,5,6, and 7 as the initial butterflies, but if it improves performance to add 9 or 10 or 11 for example, I'm open to adding it. If we add a lot of composite sizes, I suspect there will be a combinatoric explosion of complexity of which overlapping sets of factors to choose from (Ie, if we have 2 3 5, we could collapse that to 3 10, or 6 5, or 2 * 15, which is fastest? it gets messy really fast, so if we start adding more composites than we have now i'd like it to come with an architecture for managing that complexity.

RadixN can handle factors larger than its largest butterfly length, by using those larger factors as its base FFT length. For example, RadixN can gracefully handle something like 4 9 5 7 101, by processing the 4 9 5 * 7 internally, and using 101 as the base FFT size.

Benchmarking shows that RadixN is marginally worse than Radix4 at power of two sizes, as you might expect given that it has more branches, etc. So the planner still uses Radix4 when computing 4k base, and still tries to maneuver the base to make 4^k happen.

Where the performance of RadixN really shines is in collections of small factors - For 44100 (2^2 3^2 5^2 * 7^2), we get a 60% improvement:

test bench_radixn_44100_planned                      ... bench:     741,795 ns/iter (+/- 38,680)
test bench_radixn_44100_radixn                       ... bench:     453,662 ns/iter (+/- 12,686)

For 48000 (2^7 3 5^3) we get a 45% improvement:

test bench_radixn_48000_planned                      ... bench:     685,025 ns/iter (+/- 28,818)
test bench_radixn_48000_radixn                       ... bench:     469,620 ns/iter (+/- 11,005)

These improvements come from eliminating the overhead associated with the multiple transposes that MixedRadix does. I was inspired by a comment from #65 - "With this shuffler, there is no need to use mixer radix for long ffts, since radix4 is now the fastest also there." I remembered this quote, and wondered - if it's true for powers of two, why wouldn't it also be true for non powers of two?

I plan to do one more pass of cleanup and clarity before submitting, and then like I said I'll port it over to the SIMD implementations.

ejmahler / RustFFT

RadixN: Like Radix4, but supports size 2,3,4,5,6, and 7 cross-FFTs all in the same instance #132