Estimating Planner - Githubissues

ejmahler commented 3 years ago

Right now, the planner uses hardcoded heuristics to plan FFTs. We benchmark with a bunch of different FFT setups, use those benchmarks to build an intuition about what's fastest, and then hardcode that intuition into the planner's decision-making process.

For example, in FftPlannerScalar::design_prime(), we check if any of the inner FFT's prime factors are greater than some constant. If they are, we switch to bluestein's. If they're all smaller, we use rader's. That's presumably because, through benchmarking, it was determined that large prime factors slow down Rader's Algorithm.

This approach results in generally fast FFTs, but it has some limitations:

It's very thought-intensive and labor-intensive, because a programmer has to think of every situation, write a benchmark for it, tweak their benchmarks to give useful information, glean a pattern, figure out how to code that program into the planner.
It's more or less impossible to cover every case. For example, the heuristic for how to divide factors for MixedRadix is very hard to build like this, because there's a combinatoric explosion in the choice of which prime factors to send to which inner FFT.

@HEnquist had the idea of changing our approach to instead build an estimate of how each individual FFT algorithm performs, and compose those estimates together to estimate how a chain of FFT algorithms would perform.

This approach is appealing because it limits the scope of any individual heuristic. Instead of having to measure Rader's Algorithm's prime factors, we only need to measure how Rader's Algorithm itself performs, independently of its child FFT. Then, when choosing a FFT algorithm, we can compose our estimate of Rader's Algorithm with our indepently-determined estimate of rader's child FFT.

As a proof of concept, I forked @HEnquist 's crate for generating performance charts https://github.com/ejmahler/fftbenching and changed it to measure the performance of MixedRadix2xnAvx, 3xnAvx, 4xnAvx etc, divided by the performance of its inner FFT, and I got this chart:

performance

Based on eyeballing the chart, we can estimate a few different approaches to computing a FFT of size 480k. One option is to use MixedRadix12xnAvx with an inner FFT of size 40k. Eyeballing the chart, this would add a 15x multiplier to whatever the performance of the 40k FFT is.

Estimating other strategies from going from 480k to 40k:

We could use a step of 3xn, then a step of 4xn. 3xn would add a multiplier of 3.85x, and 4xn would add a multiplier of 4.7x. These would combine to multiply the running time of the 40k FFT by 18.1x.
We could use a step of 2xn, then a step of 6xn. 2xn would add a multiplier of 2.5x, and 6xn would add a multiplier of 7.1x. These would combine to multiply the running time of the 40k FFT by 17.75x.

So based on these rough estimates, 12xn is much faster. This matches the hardcoded heuristics of the AVX planner, which strongly prefers 12xn over any other algorithm. If we could find a way to automate this estimation process, we could have a planner whose heuristics are much more self-contained and reliable.

Some unanswered questions:

Is this approach sound? It's possible that composing "multiplers" like this misses some important facet of performance.
How do we improve the accuracy of benchmark collection? The data in the example above is very noisy, which will make it hard to generate useful estimates.
How do we code these estimates into the planner? I was thinking of storing a list of keyframes, where each entry in the list is tuple of FFT size and multiplier at that size, and we lerp between keyframes. But how do we build these keyframes in a rigorous way from the data? Eyeballing it worked for this example, but probably isn't great overall.
How will the planner use these estimates? One idea I had was to run Dijkstra's Algorithm, where each node is a FFT size, and each edge is a FFT algorithm, and the edge cost is the multiplier for using that FFT algorithm.
- One snag with Dijkstra's Algorithm is Bluestein's Algorithm. Bluestein's supports an infinite number of inner FFT sizes, so we'll have no choice but to hardcode some heuristics here.
- Another idea is to do something like dynamic programming, where instead of going top down, we go ground-up, generating every possible combination of factors as we build. A cool thing about this approach is that we already have a recipe cache, so we could include the estimate in the Recipe struct, and use the recipe cache to memoize each layer. Then, if the caller planned another FFT, the recipe cache would still have a lot of the estimates ready to go.
How do we automate the analysis of benchmarks? This setup seems a lot more daunting if it requires hours of manual analysis and data entry, whereas if one of us can just run the benchmarks overnight and then run an analysis script, it seems a lot easier.
MixedRadix8xn etc are easy to graph like this, because they only have one inner FFT. MixedRadix and GoodThomasAlgorithm have two inner FFTs, so they have two independent variables. How do we measure these?

I wrote all this out to collect my thoughts on the issue and start a centralized discussion on it. I can see this shipping with rustFFT 5.1 or something if every goes smoothly - or maybe this is a yearlong project.

HEnquist commented 3 years ago

I just made a PR for my first prototype estimating planner, mostly to show it and get some feedback. For mixed raders, it doesn't estimate all the different combinations of inner lengths, instead it compares two strategies, both borrowed from the existing fixed planner. The first is to try to make the two lenghts as equal as possible, and the second to split to one power-of- two length and one with whatever is left.

Bechmarking seems tricky. We need many many measurement points for this, so we can't average for too long on each one. I tried a few different ways but it's always noisy. One problem seems to be the fancy turbo frequencies etc of modern cpus which makes the speed vary in strange ways. Criterion is good at giving consistent results, but it's much too slow.

pictographer commented 3 years ago

fftw automated the process of generating benchmarks based on the requested parameters and it would automatically search among the benchmarks for the best implementation on the user's target platform. I'm no expert, so I don't know if this would be useful for RustFFT but it seems like a clever approach.

HEnquist commented 3 years ago

Yes fftw can run some measurements to make sure it picks the best implementation. The downside is that this is relatively time consuming (I think is was half a second or something last time I tried it). It can also make a very large set of measurements and store the results in a "wisdom" database, which can then be used to instead of making new measurements. Both of these are quite brute force, and my feeling is that it's possible to plan well enough without them.

ejmahler commented 3 years ago

I definitely think a measuring planner is an interesting direction to go long-term. I'd call it out of scope for at least a couple years though, because yeah, i think we can get users to the 95% level with just estimation.

ejmahler commented 3 years ago

I was searching for info on more accurate measurements, and I found this: https://hackmd.io/sH315lO2RuicY-SEt7ynGA

It's pretty long, but the gist of it is that this crate has an unreleased (ie it's in master but not on crates.io) feature to measure performance based on CPU cycle counter

To use it, you'd have to:

clone the repo and run cargo doc --open to learn the API to use
Find a linux distro, because it only supports linux
pin the benchmark process to a single CPU
run the benchmark, and time it with the meaureme library

Based on the results in that article, this may give us measurements with significantly less noise. @HEnquist Are you interested in investigating this? If not I can do it after I've wrapped up real FFT work.

HEnquist commented 3 years ago

Very interesting! I'll take a look when I'm done with adding sse/avx to my resampling lib (got inspired by the big gains avx gives in rustfft). In less than a week hopefully.

ejmahler commented 3 years ago

https://github.com/bheisler/iai released today

HEnquist commented 3 years ago

That is nice! Looks much easier to use than measureme. I'll give it a try!

HEnquist commented 3 years ago

I played a little with iai just now, looks very promising! https://github.com/HEnquist/RustFFT/blob/iai/benches/bench_iai.rs The only problem is that it can't exclude the setup. So each test must run twice, once with and once without performing the actual fft. But it's impressively stable. Running the bench over and over gives numbers that mostly stay within 0.02%.

ejmahler commented 3 years ago

Great!

It seems like the next step, until they support some form of procedural testing, is to make a rust script that generates test cases, similar to my "compare_2nem_strategies" benchmark.

How long did it take to run? To benchmark something like MixedRadix with this, we'd need a really high volume of tests

HEnquist commented 3 years ago

Those 32 tests (or really 16 since I need to run each one twice) ran in 7 seconds on my ryzen laptop. Leaving it overnight should produce a nice big set of results.

I'm thinking to use macros to generate the test cases, and then make a python script to read and analyze the results. There is a iai_macro crate that should make things a little easier, but I haven't figured out how to use it yet.

HEnquist commented 3 years ago

I made some simple macros to generate the tests, and ran a series for radix 4. Looks pretty nice! radix4 Length on x, estimated cycles on y. Next I'll look into the algorithms that have inner ffts.

ejmahler commented 3 years ago

One thing I observed while benchmark radix4 is that it's faster at small-medium sizes, but at large sizes, making a mixed radix of roughly-even size was faster.

As a quick test to see if this tool is giving real-world-applicable results, one useful test would be to plot the radix4 results, compared with the results of putting the power2 FFT inside a MixedRadix, with radix4 as the two inner FFTs. IE for 4096, do a MixedRadix with 2 inner size-64 Radix4 instances, for 8192, do a MixedRadix with inner 64 and 128 Radix4 instances.

If it's working, we should see radix4 be faster at first, but get overtaken by MixedRadix.

HEnquist commented 3 years ago

Actually, that seems to have changed after this: https://github.com/ejmahler/RustFFT/commit/6ab56f96fa0c0779a8886d2755c979bd836e8f41#diff-07bb71e908a04b41bccc8c3665eae0e78f9d562af2e4425df3c442c2337f5bdc

That "slightly faster" seems to grow with length, so I don't really see mixed radix winning any more (at length 4194304 and below at least). In my previous benchmarks at 4194304, I get that mixed radix is 17% slower than radix4. Using iai I get mixed radix 28% slower instead.

ejmahler commented 3 years ago

Oh dang. Well as long as iai reflects reality that's a good sign haha

ejmahler / RustFFT

Estimating Planner #37