data coalescing and all out of place

frymawe commented 8 months ago

With the possibility of breaking changes approaching what is the appetite for some input and output stride options? These are very useful for channelizers and resamplers.

Also for resamplers, the outputs of the forward fft get used for multiple inverse ffts, so right now that means memory copies, because even the out of place fft trashes the inputs. If there was a totally out of place transform with an immutable input it would resolve that.

Lastly, and in the same spirit as the strides, the option to output deinterleived real and imaginary would be nice for subsequent processing to avoid shuffles. I've used that with fftw and while not as clean as using something like &[complex\<f32x8>] it plays better with strides.

I can pitch in and help if these are things you find interesting.

HEnquist commented 8 months ago

frymawe commented 8 months ago

I'll benchmark the strided outputs vs manual transpose in fftw and cufft and see what the impact is, then post here.

The deinterlieving shouldn't have any additional cost, it might actually be faster as I think they are segregated into their own simd registers anyway.

Immutable inputs I have no intuition on, it seems like it should just cost extra scratch, but there could be som proximity benefits?

ejmahler commented 8 months ago

I spent a while thinking about deinterleaving while working for the AVX pathway. For scalar code, there's nothing stopping us from implementing a deniterleaved pathway today. For SIMD though, I remember there being an issue where it was a challenge to keep buffers aligned when the FFT size wasn't a multiple of the lane width.

As i think it through now, I can't remember exactly what the problem was, but it was something about not easily being able to pass data to sub-FFTs, but it's possible I just didn't think it through thoroughly enough.

I'm certain that deinterleaving would help the AVX pathway and probably other SIMD paths, since at the moment complex multiplies require a few data shuffles for each and every multiply, and that wouldn't be necessary if we had reals and imaginaries in separate registers. It's a daunting amount of work though, because all of the knowledge built about what algorithms are fastest, etc goes out the window.

I'm absolutely down to incorporate something like this, but it reminds me of something I was thinking about when working on the now-stalled real-only FFT work: At what point is this project too big for one crate? RustFFT as-is fills the single-purpose principle very well. The real-to-complex, strided, and deinterleaved features all introduce essentially entirely new, entirely separate pathways for FFT computation, diluting that single-purpose idea.

Something I was toying with back then but never carried through on was splitting into multiple crates, with RustFFT being an umbrella crate that just re-exports things from those other crates. I have no experience with that though, so I have no idea if it would make things simpler or harder. At minimum it would reduce compile times for people who know they only need one of the pathways.

This has been a long post, but the summary of my stance is that I'm very open to new strided, deinterleaved, and immutable-input pathways, but we're talking about adding a lot of new code to the project, so we should put some thought in beforehand about the best way to stop it from spiraling out of control.

ejmahler commented 8 months ago

Ah, now I remember. At the time, I was thinking along the lines of passing a single contiguous slice to a child FFT, where the reals would be in the first half of the slice, and the imaginaries would be in the second half. And this obviously doesn't work when delegating to a child FFT, because you'd have to do something similar to a transpose in order to get the parent reals and imaginaries into the right place.

But if the (out of place) trait method had 4 slices instead of 2 - input real, input imaginary, output real, output imaginary, then you could do parallel chunks of those exactly the same way the existing algorithm does parallel chunks of the interleaved buffers.

ejmahler / RustFFT

data coalescing and all out of place #125