ejmahler / RustFFT

RustFFT is a high-performance FFT library written in pure Rust.
Apache License 2.0
678 stars 47 forks source link

Autogenerate entire prime butterfly files, rather than just chunks #137

Closed ejmahler closed 5 months ago

ejmahler commented 5 months ago

This PR replaces the old python script for generating chunks of prime butterflies with a Rust program (under tools/gen_simd_butterflies) that generates the entire file. It's a single program capable of generating output for SSE, NEON, and Wasm Simd, determined by a command line argument. When we support FCMA, it will be trivial to add that as an option. More instruction sets like AVX could be added, but it will take more work so I didn't include it here.

I've been thinking about long-term sustainability of current approach to SIMD. We're talking about adding a new code path for the FCMA instruction set, there's AVX512 to consider planning around, AVX10 was announced, etc, and I don't think it's sustainable to keep copying as much code as we're copying if we want implementations for each of these - which I do. I think the biggest challenge is the butterflies. Even with our excellent boilerplate macros, they are a massive pain to work with and it's very easy to make mistakes. Any system-wide change we want to make has to be duplicated dozens of times - once per length, and then repeated per instruction set.

The high-level goal of this PR is to wrangle some of the complexity of our butterfly system - to take something that is extremely repetitive and verbose and replace it with something concise - all while keeping everything human-readable. This code generation functions very similarly to a macro, but unlike a macro, we can easily see the expanded code, contributing to the goal of human readability I mentioned above.

I also don't think our O(n^2) pile of multiplies and adds we do in the body of these FFTs is easy to replicate with a macro, and that's the most important part to automate!

If this ends up making our code easier to reason about, I would like to expand it to perhaps include good-thomas butterflies and the bigger mixed radix butterflies. There will always be some hand-written ones we don't want to automate (1, 2, 3, 4, 8 being the most likely sizes that we always hand-write) but all others operate on a pretty strict formula that we don't need to trouble ourselves with writing over and over and over again.

This PR also rearranges the body of these FFTs so that the O(n^2) pile of math expressed as FMAs, meaning we can use explicit FMA on neon while keeping the implementation of each instruction set identical. It mildly helps Wasm Simd and largely doesn't change the performance of SSE, but the rearranging alone brings solid performance wins to Neon. Converting the math ops to FMA helps even more.

This PR also moves the rotations before the big math pile - that's because for FCMA, we'll be able to omit the rotations entirely as long as we make sure to start from the "rotations first" format, and I want to keep all the implementations as similar as possible, so they all do rotations first.

Attached are performance measurements. autogen_sse_f32 autogen_sse_f64

autogen_wasm_simd_f32 autogen_wasm_simd_f64

autogen_neon_f32 autogen_neon_f64

ejmahler commented 5 months ago

Aside from fixing the build, there are 2 more tasks to complete before I can merge this:

HEnquist commented 5 months ago

This looks great! It would be neat to avoid preprocessing. I'm not familiar with tinytemplate and handlebar, but I use jinja2 in python, and it supports custom delimiters. There is minijinja for rust, which also supports this: https://docs.rs/minijinja/latest/minijinja/syntax/index.html#custom-delimiters This could maybe be used to avoid escaping.