Support interleaved complex numbers

calebzulawski commented 2 months ago

A naive first shot at this could simply reorder the inputs and outputs, but a better implementation could embed this in the read and write loops.

smu160 commented 2 months ago

Hi @calebzulawski,

Thank you for bringing up this issue. I'm assuming what you mean is that we should be able to provide the input to fft_64 and fft_32 as a vector of structs, such that each struct is of type num::complex::Complex.

Namely, we should have a version of fft_64 (and fft_32) with the following input parameters:

pub fn fft_64(input: Vec<Complex64>, direction: Direction) {
    // <-- snipped -->
}

The original reason why this was never tried in phastft is due to the significant difference in code gen and performance that we saw in the quantum state simulator. Perhaps something such as SOA-derive can be of use here?

Let me know if I'm way off base or any other thoughts you may have. Thanks!

Best, Saveliy

calebzulawski commented 2 months ago

Yep, that's exactly what I mean. I agree that it would definitely be a bit slower, so it would be best to offer it both ways--mostly as a convenience because sometimes you can't choose your data format. Perhaps even offering combinations (interleaving only the input or output, for an out of place operation) would be helpful too.

smu160 commented 2 months ago

@calebzulawski

Thank you for getting back to me. I wonder if it could be more performant to just "pre-process" the array-of-structs into the struct-of-arrays and then back again (also taking into account how the output should be formatted). I know that the interleaved format needed a lot of permute instructions when I used it in the past. You may have more insight to offer when it comes to that though.

Let me know what you think. Thanks!

Best, Saveliy

smu160 commented 2 months ago

@calebzulawski

What do you think of starting out by offering a version of fft_64_interleaved and fft_32_interleaved that call the following function:

pub fn separate_re_im(signal: &[Complex<f64>], chunk_size: usize) -> (Vec<f64>, Vec<f64>) {
    let mut reals = Vec::with_capacity(signal.len());
    let mut imaginaries = Vec::with_capacity(signal.len());

    // We don't assume power of 2 size, so we don't use `chunks_exact`.
    // Process each chunk, including the last chunk which may be smaller.
    for chunk in signal.chunks(chunk_size) {
        for num in chunk {
            reals.push(num.re);
            imaginaries.push(num.im);
        }
    }

    (reals, imaginaries)
}

Note that this function isn't generalized for f64/f32 as it's only for demonstrative purposes.

before calling the fft_64 or fft_32 that we already have.

Given how simple this function is, it should be possible to vectorize the heck out of it, just as @Shnatsel did for twiddle factor generation. Of course, we'd need the inverse function as well (i.e., split reals/imags back into Complex).

I wonder if the overhead of this would be okay considering that it would allow us to avoid permute instructions when we utilize SIMD. In that past, I wasn't able to get around that when complex numbers were stored together.

Looking forward to hearing your thoughts. Thanks!

Best, Saveliy

Shnatsel commented 2 months ago

Performance tip: always use chunks_exact + remainder instead of chunks.

And yes, it should be possible to vectorize the heck out of this.

calebzulawski commented 2 months ago

Yes, I think something like that is probably sufficient. With some trickery, it would also be possible to do it in place, since &mut [Complex<f64>] is the same shape as &mut [f64] with double the elements. I think more broadly you could generalize input and output transformations for various input and output shapes (plus things like output normalization by 1/N or 1/sqrtN).

I think there's a little bit more performance to pick up by combining the input/output transformations with the first or last loops of the FFT, but is that really important? Probably not to start.

calebzulawski commented 2 months ago

Performance tip: always use chunks_exact + remainder instead of chunks.

And yes, it should be possible to vectorize the heck out of this.

I have found it can actually be slightly more efficient to instead segment like this, imagine you have 11 elements in chunks of 4: [0, 1, 2, 3], [4, 5, 6, 7], [7, 8, 9, 10]. The 7th element gets computed twice (assuming you have working memory that is not invalidated) but this allows you to do only vector loads, rather than a slow sequential load at the end.

QuState / PhastFT

Support interleaved complex numbers #22