Open calebzulawski opened 2 months ago
Hi @calebzulawski,
Thank you for bringing up this issue. I'm assuming what you mean is that we should be able to provide the input to fft_64
and fft_32
as a vector of structs, such that each struct is of type num::complex::Complex
.
Namely, we should have a version of fft_64
(and fft_32
) with the following input parameters:
pub fn fft_64(input: Vec<Complex64>, direction: Direction) {
// <-- snipped -->
}
The original reason why this was never tried in phastft
is due to the significant difference in code gen and performance that we saw in the quantum state simulator. Perhaps something such as SOA-derive can be of use here?
Let me know if I'm way off base or any other thoughts you may have. Thanks!
Best, Saveliy
Yep, that's exactly what I mean. I agree that it would definitely be a bit slower, so it would be best to offer it both ways--mostly as a convenience because sometimes you can't choose your data format. Perhaps even offering combinations (interleaving only the input or output, for an out of place operation) would be helpful too.
@calebzulawski
Thank you for getting back to me. I wonder if it could be more performant to just "pre-process" the array-of-structs into the struct-of-arrays and then back again (also taking into account how the output should be formatted). I know that the interleaved format needed a lot of permute instructions when I used it in the past. You may have more insight to offer when it comes to that though.
Let me know what you think. Thanks!
Best, Saveliy
@calebzulawski
What do you think of starting out by offering a version of fft_64_interleaved
and fft_32_interleaved
that call the following function:
pub fn separate_re_im(signal: &[Complex<f64>], chunk_size: usize) -> (Vec<f64>, Vec<f64>) {
let mut reals = Vec::with_capacity(signal.len());
let mut imaginaries = Vec::with_capacity(signal.len());
// We don't assume power of 2 size, so we don't use `chunks_exact`.
// Process each chunk, including the last chunk which may be smaller.
for chunk in signal.chunks(chunk_size) {
for num in chunk {
reals.push(num.re);
imaginaries.push(num.im);
}
}
(reals, imaginaries)
}
Note that this function isn't generalized for f64
/f32
as it's only for demonstrative purposes.
before calling the fft_64
or fft_32
that we already have.
Given how simple this function is, it should be possible to vectorize the heck out of it, just as @Shnatsel did for twiddle factor generation. Of course, we'd need the inverse function as well (i.e., split reals/imags back into Complex
).
I wonder if the overhead of this would be okay considering that it would allow us to avoid permute instructions when we utilize SIMD. In that past, I wasn't able to get around that when complex numbers were stored together.
Looking forward to hearing your thoughts. Thanks!
Best, Saveliy
Performance tip: always use chunks_exact
+ remainder
instead of chunks
.
And yes, it should be possible to vectorize the heck out of this.
Yes, I think something like that is probably sufficient. With some trickery, it would also be possible to do it in place, since &mut [Complex<f64>]
is the same shape as &mut [f64]
with double the elements. I think more broadly you could generalize input and output transformations for various input and output shapes (plus things like output normalization by 1/N or 1/sqrtN).
I think there's a little bit more performance to pick up by combining the input/output transformations with the first or last loops of the FFT, but is that really important? Probably not to start.
Performance tip: always use
chunks_exact
+remainder
instead ofchunks
.And yes, it should be possible to vectorize the heck out of this.
I have found it can actually be slightly more efficient to instead segment like this, imagine you have 11 elements in chunks of 4: [0, 1, 2, 3], [4, 5, 6, 7], [7, 8, 9, 10]. The 7th element gets computed twice (assuming you have working memory that is not invalidated) but this allows you to do only vector loads, rather than a slow sequential load at the end.
A naive first shot at this could simply reorder the inputs and outputs, but a better implementation could embed this in the read and write loops.