get_inplace_scratch_len results compared to the input size

liebharc commented 3 years ago

Hi,

Thanks for the rustfft update :). I'm wondering about the results of get_inplace_scratch_len. I'm not claiming it's wrong, but I would like to double check with you.

I naively would have expected that the required scratch size never exceeds the length of the input array, but it does for some values, e.g.:

        let points = 5466;
        let mut input : Vec<Complex<f32>> = vec![Complex::zero(); points];

        let fft = {
            let mut planner = FftPlanner::new();
            planner.plan_fft_forward(points)
        };

        let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()];
        fft.process_with_scratch(&mut input, &mut scratch);
        assert!(scratch.len() <= input.len()); // Fails as 9354 > 5466

get_outofplace_scratch_len returns for a same input length 0.

As the required scratch space is a lot larger than the input size I'm wondering if that is correct/really needed?

HEnquist commented 3 years ago

The scratch space gets used by all the inner ffts, so the length must be sufficient for all of them. In case of Bluesteins for example, the length of the inner fft is at least 2*len -1. I guess this is behind what you are seeing (but I haven't checked exactly how 5466 gets planned).

liebharc commented 3 years ago

Thanks for the quick anwer. I did some more tests and yes Bluesteins seems to be the reason for this. In some experiments Bluesteins requires the scratch size to be about 5times the input size. So it seems to come with quite some cost regarding memory. Based on that I'm wondering if there should be an option to disable Bluesteins? A draft of that was used for my testing: https://github.com/liebharc/RustFFT/commit/2b08c71090f9cb3d22c7099285aa5955ac69f76c (Scalar plan builder only so far)

Downsides would be a more complicated API, possible slower performance (regarding CPU) for users which pick that option and that every added option makes the estimation planner and its analysis likely more complicated.

ejmahler commented 3 years ago

Tl;dr, it's expected that get_inplace_scratch_len() might return a number larger than the FFT len, even if get_outofplace_scratch_len() doesn't. I'll update the documentation to make that clear.

I'll go into more detail about exactly how this FFT is planned, if you're curious, to explain why that scratch is needed.

First, the prime factorization of 5466 is 2 3 911. RustFFT works by factoring FFT sizes into their prime factorization, and recursively computing sub-FFTs based on the prime factorization. Specifically, for 5466, the first thing the planner will do is create an instance of MixedRadix6xnAvx. This struct will treat the size-5466 input array as if it was a 2D array with a width of 911 and a height of 6. It computes sub-FFTs of size 6 down each column of that 2D array, then computes sub-FFTs of size 911 across each row.

Computing a sub-FFT of size 6 is very easy, and that struct exists because it's specialized to do so with AVX code, hence the "6xnAvx" in the name.

But computing a sub-FFT of size 911 is a little more challenging, because it can't be factorized anymore -- it's a prime number. For prime numbers, we have two choices:

Rader's Algorithm. This uses some number theory that only works for prime numbers, to reformulate a size-N FFT as a size-(N-1) FFT. If N is prime, then N-1 is guaranteed to be composite. And since N-1 is composite, we can continue our factorization logic above.
Bluestein's Algorithm. This reformulates a size-N FFT as a larger FFT, of at least twice the size. The inner FFT can be any size, so you're free to choose a FFT size that's easy/fast to compute.

At first glance, Rader's Algorithm might seem faster, but if the inner FFT for Raders has to deal with large prime factors itself, you end up having to deal with multiple layers of rader's algorithm, while Bluestein's can be set up to avoid that. So generally, Rader's is faster in a few specialized cases, while bluestein's provides more reliable baseline performance.

In this case, Bluestein's ends up being faster, so it plans a step of Bluestein's Algorithm, with an inner FFT size of 1944. Bluestein's will need a buffer to compute its inner FFT inplace, and its inner FFT will also need its own scratch space that will (probably) be equal to the inner FFT length - so the total scratch space it needs is 3888. And finally, for internal reasons that come down to the way data gets moved around internally during the in-place FFT process, MixedRadix6xn can't supply that much scratch space to its inner Bluesteins without requesting extra from the caller.

MixedRadix6xn could be tweaked so that it can provide that much scratch space, but a tradeoff would be that at the end of the method, it would have to do a memcpy from its scratch to its inplace buffer. I ultimately decided that requiring a little extra memory from the caller was worth avoiding the memcpy.

I hope you find this useful, I partially typed this all up to make sure I understood it myself :)

ejmahler commented 3 years ago

I think it's a great idea to make a variant of the planner that optimizes memory usage instead of maximum performance. In this case, there are a lot of other factors at play too regarding memory: For the scalar planner, MixedRadix in this situation can be swapped out for GoodThomasAlgorithm, which is 10% to 20% slower but requires len * sizeof(Complex<T>) less memory. Like you say, Bluestein's could be swapped out for Rader's, etc. There's a lot to explore here.

You should beware that always using raders will make FFT computation O(N^2) in a few very rare cases. Bluestein's only requires a extra scratch space when it's at the "top" of a chain of FFT algorithms -- if it's down the line far enough, then the algorithms that come before it have enough slack space that they can feed it in, so I think a better strategy would be to check in the "thrifty planner" how much scratch bluestein's will require. If it doesn't take any extra, use that as normal. That way, we can avoid the O(n^2) worst case.

ejmahler commented 3 years ago

For the time being, I don't have enough available time to do something like a "thrifty planner". I'm very open to something like that in principle, but I think it will take a lot of attention to detail that I don't have time for. So for the time being, RustFFT will keep focusing on maximum performance. But I promise I'll design things with a potential future focus on memory usage in mind!

In the meantime, if memory usage is a concern for you, I recommend either finding a nearby FFT size that doesn't need extra scratch (For example, 5472 shouldn't), or depending on a fork of RustFFT that disables bluestein's like you've done above.

ejmahler commented 3 years ago

So for the time being, RustFFT will keep focusing on maximum performance.

This is unless, of course, someone else is willing to take this on!

HEnquist commented 3 years ago

That might not be so bad when using an estimating planner. It would basically just need to add a penalty to the estimated cost depending on the amount of extra memory a recipe needs. One could then make several presets, like max performance, minimum ram usage, and something balanced in between. The estimating planner is fun to work with, but personally I'm only interested in max speed :)

ejmahler commented 3 years ago

PR #47 improves the documentation for get_inplace_scratch_len()

@liebharc If you'd like to continue discussion of a planner that optimizes for memory use, make a new issue for it and we can continue discussion there.

liebharc commented 3 years ago

Sounds good. It might be good to document if an upper boundary exists and if so what value it has. E.g. the scratch size will always be <= N * input size.

Regarding changes to the planner: Lets first wait until further feedback arrives. For me it's fine as it is.

ejmahler / RustFFT

get_inplace_scratch_len results compared to the input size #45