Improve performance for powers of two

calebzulawski commented 4 years ago

The mixed-radix FFT algorithm's performance isn't too bad (at the time of writing, powers of three are something like 60% slower than FFTW on my machine).

However, the mixed-radix algorithm (at least as it exists right now) isn't the best choice for powers of two (approximately 3x slower than FFTW on my machine). I have to imagine the discrepancy here is that FFTW is probably using a modified split-radix algorithm. I'd probably be happy with a standard split-radix implementation for simplicity.

ejmahler commented 4 years ago

One thing worth looking into with all of this: When comparing benchmarks to rustFFT, is the difference in performance between power2 sizes down to algirthms, or is it down to the SIMD?

If it's just SIMD, I don't think you'll get much from split radix. I implemented it for rustFFT and ended up going back to radix4 because it was actually slower, not faster. I didn't implement the modified one, so maybe it's faster, but the impression i get is that it shaves off fractions of percents of performanc,e not 60%.

It's also possible that fourier's code structure is just better for code generation than rustfft, or maybe split radix is particularly well-suited to SIMD. So it's worth investigating.

calebzulawski commented 4 years ago

I do believe that the performance difference between Fourier and RustFFT comes down to SIMD (for powers of two, at least). It might be worth looking into the arithmetic operation count between mixed-radix and split-radix, just to get an idea of what to expect in the optimal case, you might be right that it's only a small improvement. If split-radix isn't the answer, I have to wonder what incantations FFTW, Intel, etc use for powers of two for such a big difference in performance.

ejmahler commented 4 years ago

I've done some exploration into split radix with AVX, and I've got promising results. According to the benchmarks, my hardcoded sizes are actually faster than FFTW, Larger split radix sizes are slower than FFTW, but faster than fourier.

Here's a link to the implementation

The main event is the SplitRadixAvx struct, and MixedRadixAvx4x2, 4x4, etc structs.

To assemble a power of two algorithm for benchmarks, i've been using this function:

fn get_splitradix_avx(len: usize) -> Arc<FFT<f32>> {
    match len {
        8 => Arc::new(MixedRadixAvx4x2::new(false)),
        16 => Arc::new(MixedRadixAvx4x4::new(false)),
        32 => Arc::new(MixedRadixAvx4x8::new(false)),
        64 => Arc::new(MixedRadixAvx8x8::new(false)),
        _ => {
             let mut radishes = Vec::new();
            radishes.push(Arc::new(MixedRadixAvx4x8::new(false)) as Arc<FFT<f32>>);
            radishes.push(Arc::new(MixedRadixAvx8x8::new(false)) as Arc<FFT<f32>>);

            while radishes.last().unwrap().len() < len {
                let quarter = Arc::clone(&radishes[radishes.len() - 2]);
                let half = Arc::clone(&radishes[radishes.len() - 1]);
                radishes.push(Arc::new(SplitRadixAvx::new(half, quarter)) as Arc<FFT<f32>>);
            }
            Arc::clone(&radishes.last().unwrap())
        }
    }
}

The hardcoded sizes are all pretty darn fast. I imagine there are improvements to be had, but they're gonna be small. I think there are still a lot of major improvements to be made to the split radix impl.

I also haven't put any work into f64 inputs at all, I imagine that's going to take an entirely different set of techniques.

I'd love to incorporate this into rustfft, but I need specialization to land before I can release a stable version, and there's a pretty brutal specialization bug that is preventing me from even incorporating it into the planner. But your library doesn't depend on specialization, so I bet you could use this stuff right away. Let me know what you think!

ejmahler commented 4 years ago

I implemented a few different hardcoded sizes using both mixed radix and split radix. For example I wrote both a split radix size 16, and a mixed radix 4x4. And I wrote a split radix size 32 implementation and a mixed radix 4x8 implementation. In both cases, the mixed radix was both dramatically simpler and dramatically faster.

Split radix definitely reduces multiplication counts and requires fewer twiddle factors, which is a big deal at large sizes, but for small sizes, all the reshuffling of data appears to just get in the way.

calebzulawski commented 4 years ago

Whoops, writing this on mobile and hit the wrong button :( This is very interesting and a great improvement, hopefully this weekend I'll get a chance to look into it. I wonder if there is a reformulation that is similar to Stockham autosort that eliminates the shuffling?

calebzulawski / fourier

Improve performance for powers of two #2