[WIP] Partially deinterleave SSE radix4 data

ejmahler commented 7 months ago

This change implements partial deinterleaving of FFT data for SSE radix4.

I call it partial deinterleaving because we only deinterleave out to the size of a single SSE vector, IE 2 complex elements for f64, and 4 complex elements for f32. The deinterleaved vectors are still stored inline in an alternating format.

IE, for f32, the data layout of our SSE vectors (and scratch memory) is transformed from [r0, i0, r1, i1], [r2, i2, r3, i3], [r4, i4, r5, i5], [r6, i6, r7, i7] ... to [r0, r1, r2, r3], [i0, i1, i2, i3], [r4, r5, r6, r7], [i4, i5, i6, i7] ...

Once this transformation is made, we can load the reals into one vector and the imaginaries into another vector, and we thus sidestep all swizzling that needs to be done for rotations and complex multiplies. Obviously we have to transform the data back by the time the algorithm is finished, but the longer we keep it in deinterleaved format the better off the performance gains.

Initially, I tried always storing it interleaved, and deinterleaving/reinterleaving it on each pass, but that was slightly worse than the existing code - the biggest gain came from only deinterleaving once, and only re-interleaving once, with the middle passes just loading/storing it in its already-deinterleaved format. This is implemented in a very messy way right now, with 4 different butterfly_4 functions depending on if this is the first pass, last pass, a middle pass, or the only pass. Ideally we'll find a smoother way to write this. Maybe some const generics for whether or not to reorder the input/output?

One thing I'm very curious to try is to see if the base FFT would benefit from being deinterleaved too. If that's the case, we could do the deinterleave pass during the transpose, then pass deinterleaved data to the base FFT. There's a big complexity drawback here where we'd either need a separate trait for deinterleaved data, or deal with the footgun of having data pass through the FFT trait expecting a completely different format.

I probably won't merge this for a while, I want some time to think about exactly how the architecture should work. I suspect wasm simd would benefit the most from this change, but I did it on SSE first because it's the easiest to test. @HEnquist @pr1metine if you have time i'd love your thoughts on where this should go from here.

Benchmarks: sse_f32

sse_f64

ejmahler commented 7 months ago

One more thought: If we have a trait for computing a pre-interleaved FFT, we could make a variant of bluestein's and rader's algorithm that did the deinterleaving during their setup/finalization phases - which would save even more time because they each compute their inner FFT twice, and do a complex multiplication pass in between. For bluestein's it would be even better: The inner FFT is at least twice the length of the outer FFT, but we only have to actually do the deinterleaving for half of the inner FFT data, so from the inner FFT's perspective, we cut the overhead of deinterleaving to less than a quarter of what it is in this radix4 implementation.

ejmahler commented 6 months ago

As I was developing a SSE transpose so I could build the de-interleave into the transpose, I realized that the transpose itself would probably affect performance all on its own. So to be scientific about it, I decided to measure the new transpose. The new "SSE transpose" line in these charts includes both the deinterleaving changes above and the new transpose. It's a pretty solid improvement all on its own. The f64 one doesn't actually include any SSE transpose instructions, so I suspect the improvement comes just from unrolling the loop.

sse_transpose_f32 sse_transpose_f64

pr1metine commented 6 months ago

Hey sorry for the late reply. These results do look very promising. While I do not believe I can contribute to the architecture in a meaningful way, I‘m happy to implement it once we are happy with it.

rickwebiii commented 6 months ago

This looks highly related to a feature I was just about to request. Our application needs to perform polynomial multiply-adds via FFT as fast as possible. To this end, we're finding that one of the remaining bottlenecks is in mad-ding the O(N) frequency domain points together. This happens because the compiler has to introduce a bunch of swizzling code to deinterleave the Complex<f64> values. It occurred to me that it would be nice to have another API that took the time and frequency domain (and vice versa for ifft) arrays as 2 real and imaginary slices rather than Complex<f64> values. I think this might also speed up the FFT itself.

Perhaps this work could be taken to the logical conclusion and you introduce another process_fast method to the Fft trait that allows users to not fool with Complex<T> values?

ejmahler commented 6 months ago

This looks highly related to a feature I was just about to request. Our application needs to perform polynomial multiply-adds via FFT as fast as possible. To this end, we're finding that one of the remaining bottlenecks is in mad-ding the O(N) frequency domain points together. This happens because the compiler has to introduce a bunch of swizzling code to deinterleave the Complex<f64> values. It occurred to me that it would be nice to have another API that took the time and frequency domain (and vice versa for ifft) arrays as 2 real and imaginary slices rather than Complex<f64> values. I think this might also speed up the FFT itself.

Perhaps this work could be taken to the logical conclusion and you introduce another process_fast method to the Fft trait that allows users to not fool with Complex<T> values?

I definitely think it would help. The problem is that every single algorithm, for every single instruction set, would have to be rewritten. It would essentially be a new library. Intimidating to say the least. I'm open to exploring it, but it won't be soon. I could imagine some halfway solutions that take deinterleaved inputs and interleave them, but that would lose the performance benefits. Or maybe all FFT algorithms are written deinterleaved, and interleaved inputs are deinterleaved, but that still requires writing an entirely new library.

In the meantime, if your intermediate computations are a problem, I bet you could see some performance benefit by hand-writing some SIMD code that does the required swizzling in the most intelligent way possible.

HEnquist commented 6 months ago

Sorry for the very slow reply! This looks really promising, I didn't think there was much more room for improvement in the radix4 :) I don't think the "very messy way" it's implemented now looks so bad. The const generics help, that was a smart way of making all the four needed butterflies.

One more thought: If we have a trait for computing a pre-interleaved FFT, we could make a variant of bluestein's and rader's algorithm that did the deinterleaving during their setup/finalization phases...

Wouldn't this require simd-versions of these algorithms? I quite like the way that these "weird" algorithms only exist in a single scalar version, that can benefit from simd in their inner ffts. I don't have any numbers, but my feeling is that they are rarely used. So I would rather spend the time on squeezing more performance out of radix4, than making sims versions of bluesteins and raders.

ejmahler / RustFFT

[WIP] Partially deinterleave SSE radix4 data #133