zetanumbers commented 10 months ago

There's Discrete Fourier Transform API within Apple's Accelerate framework. I think uses some special AMX2 coprocessor on M1 macs. I wrote a simple library for DFT using apple-sys (note: apple-sys requires using apple llvm toolchain from xcode, not the one from homebrew). Benchmarks promise up to x2.0 speedup against neon implementation and x4 speedup against scalar implementation, which is super nice. I'm not sure about current implementation, but I wonder this could be combined with others. Is there a possibility of integrating it into RustFFT under some new feature?

Benchmarks

Run on Apple M1 Pro MacBook Pro macOS 14.2.1 (23C71). Visual results were normalized against scalar median benchmark times (lower is better). Horizontal values represent lengths. Every supported (as written in apple's documentation) input array length is included. Bars with input lengths which are powers of two are highlighted. Benchmark code from https://github.com/zetanumbers/apple-accelerate-dft/blob/f1793f202b52bf275103863363f028c24b9add48/benches/vs_rustfft.rs

Raw benchmark result data

|type|len|scalar_median_ns|neon_median_ns|accelerate_median_ns|scalar_deviation_ns|neon_deviation_ns|accelerate_deviation_ns| |-|-|-|-|-|-|-|-| |f32|8|12|14|20|0|0|0| |f32|12|33|14|19|0|0|0| |f32|16|21|16|19|0|0|0| |f32|20|50|41|21|0|0|0| |f32|24|69|58|23|0|0|0| |f32|32|61|30|25|0|0|0| |f32|36|118|88|29|4|1|0| |f32|40|108|75|30|1|0|0| |f32|48|147|111|39|1|1|0| |f32|60|269|140|49|5|1|0| |f32|64|164|78|60|2|1|0| |f32|72|230|137|57|4|2|0| |f32|80|230|181|61|2|2|1| |f32|96|421|214|73|5|6|0| |f32|120|585|277|93|11|4|1| |f32|128|398|177|215|4|2|0| |f32|144|998|325|226|9|4|0| |f32|160|678|371|240|13|4|1| |f32|192|946|642|264|14|10|5| |f32|240|1194|481|304|15|6|1| |f32|256|841|355|296|9|4|1| |f32|288|1243|630|353|18|7|1| |f32|320|1387|872|378|24|12|1| |f32|384|1861|1121|428|28|14|5| |f32|480|2725|1005|498|44|41|7| |f32|512|2074|895|512|43|28|2| |f32|576|2766|1600|710|36|30|6| |f32|640|3110|1842|753|22|28|4| |f32|768|3860|2230|821|20|22|2| |f32|960|5936|2589|1253|82|35|11| |f32|1024|4305|1745|1034|62|24|7| |f32|1152|6177|3411|1531|71|51|37| |f32|1920|12964|5522|2780|204|61|25| |f32|2048|10102|4220|2005|130|55|9| |f32|4096|20811|8243|4483|236|77|24| |f64|8|12|11|22|0|0|0| |f64|12|33|12|18|1|0|0| |f64|16|21|15|21|0|0|0| |f64|20|57|45|24|0|0|0| |f64|24|71|70|29|1|1|0| |f64|32|60|35|37|0|0|0| |f64|36|128|102|44|1|2|0| |f64|40|122|92|47|2|1|1| |f64|48|152|135|65|2|2|1| |f64|60|295|175|87|7|2|0| |f64|64|170|120|101|2|2|1| |f64|72|239|169|102|3|2|1| |f64|80|258|238|115|2|3|3| |f64|96|446|284|138|15|9|1| |f64|120|632|372|198|8|5|2| |f64|128|413|278|176|3|3|2| |f64|144|1064|431|230|7|21|4| |f64|160|760|512|248|10|8|2| |f64|192|1006|817|343|24|6|3| |f64|240|1294|673|434|9|9|3| |f64|256|880|594|415|10|21|3| |f64|288|1332|801|548|15|8|4| |f64|320|1590|1223|623|25|16|6| |f64|384|2012|1544|746|31|14|17| |f64|480|3049|1415|1005|130|26|8| |f64|512|2121|1412|973|37|25|10| |f64|576|2989|2214|1220|62|49|27| |f64|640|3508|2629|1331|42|114|9| |f64|768|4160|3175|1734|64|49|14| |f64|960|6615|3828|2230|72|50|26| |f64|1024|4446|2932|2203|159|22|89| |f64|1152|6633|4783|2798|98|78|39| |f64|1920|14332|8261|5039|167|141|45| |f64|2048|10308|6767|4797|75|98|37| |f64|4096|21511|14147|10522|200|125|99|

Comparison benchmarks for f32

apple-accelerate-dft-f32

Comparison benchmarks for f64

apple-accelerate-dft-f64

zetanumbers commented 10 months ago

Well, I see an issue if output is normalized in a different way from RustFFT.

ejmahler commented 10 months ago

Since we advertise being implemented in pure Rust, i'm reluctant to add a feature that requires FFI to another language, or calling into proprietary libraries.

I'm very interested in how they get this speedup though, maybe we can integrate that directly. One thing left on the table is that if I remember correctly, arm has complex multiplication instructions, which we didn't use because wasn't stabilized yet or for some similar reason. We could look into neon impls that use those instructions, which would likely speed things up.

HEnquist commented 10 months ago

Yeah arm has some very useful complex math instructions that I think will give a nice speed boost. They have been added to core::arch but are still marked as experimental: https://doc.rust-lang.org/core/arch/aarch64/fn.vcmla_f32.html I'm planning on implementing these as soon as they are stable.

zetanumbers commented 10 months ago

Since we advertise being implemented in pure Rust, i'm reluctant to add a feature that requires FFI to another language, or calling into proprietary libraries.

There can be a separate crate being a safe wrapper. Well, I'm sad to hear that. It would have been really powerfull to see rustfft embracing and combining algorithms with ones from accelerate.

I'm very interested in how they get this speedup though, maybe we can integrate that directly. One thing left on the table is that if I remember correctly, arm has complex multiplication instructions, which we didn't use because wasn't stabilized yet or for some similar reason. We could look into neon impls that use those instructions, which would likely speed things up.

As I've said it probably uses custom AMX2 coprocessor on M1 chips, so unless you are interested in reverse engineering this is the only way to access it (I couldn't have checked that assumption tho, I guess I'm not so proficient with lldb). But this is usual stuff, like standard library absolutelly uses propreitary libraries just like this. And it comes with xcode, which is requirered for rust anyway.

zetanumbers commented 10 months ago

Yeah arm has some very useful complex math instructions that I think will give a nice speed boost. They have been added to core::arch but are still marked as experimental: https://doc.rust-lang.org/core/arch/aarch64/fn.vcmla_f32.html I'm planning on implementing these as soon as they are stable.

I bet those aren't enough. Anyway it can be experimented with.

HEnquist commented 10 months ago

I bet those aren't enough. Anyway it can be experimented with.

I don't expect a factor 2. I would guess that a boost of something like 20-30% is possible.

HEnquist commented 10 months ago

I made a quick test here: https://github.com/HEnquist/RustFFT/commit/8ca7cd0ef7b5c62ce8d0656e7c372e7a79ac0b66

This only changes the complex multiplication function. The vcmla instructions could be used in many other places too. But already this small change gives a nice boost to Radix4. This is Float32, but 64 is similar:

Length      Current  Vcmla   Improvement %
64          80       70      14,3
128         181      153     18,3
256         362      289     25,3
512         912      749     21,8
1024        1789     1385    29,2
2048        4308     3428    25,7
4096        8431     6394    31,9
16384       41471    32302   28,4
65536       190050   155293  22,4
1048576.    4298154  3547916 21,1

ejmahler commented 10 months ago

What other places do you have in mind for these instructions? I was aware that they'd help with multiplication, but I'm curious to learn how else they can be used.

zetanumbers commented 10 months ago

I made a quick test here: HEnquist@8ca7cd0

I'm surprised it's implemented in two intrinsics, not one. I thought one vcmla_f64 call would have been enough.

And also, not sure how target feature handling will be done. NeonFcmaPlanner?

HEnquist commented 10 months ago

What other places do you have in mind for these instructions? I was aware that they'd help with multiplication, but I'm curious to learn how else they can be used.

They are also useful for all the rotations etc we do while applying the twiddle factors. But one could of course argue that this is also multiplication :)

I'm surprised it's implemented in two intrinsics, not one. I thought one vcmla_f64 call would have been enough.

Yeah that's how they work. A complete multiplication is (a + ib) * (c + id) = ac-bd + i(ad + bc). But a single vcmla only does half of it, so the first one only produces ac + iad, and then the second does the rest and sums it up.

And also, not sure how target feature handling will be done. NeonFcmaPlanner?

Yes, now I'm just playing but this will need to become a separate implementation with runtime detection etc.

HEnquist commented 10 months ago

Some more work on this. The left table shows the same as before, with only the complex multiplication updated. The column to the right is with updated rotations for the twiddle factors:

There is a little more to do, but I think this is already starting to get close to what is possible to get.

HEnquist commented 10 months ago

I compared a few numbers between my version and the Accelerate results above, and at least for f64 I pretty much match the speed of Accelerate. For example f64 length 1024, I get 2166 ns/iteration, and Accelerate gets 2203.

For f32 Accelerate still wins, but the difference has shrunk considerably. Length 1024 for example, I get 1298ns and Accelerate gets 1034.

HEnquist commented 10 months ago

If someone wants to try my version, use the latest nightly compiler, and add rust flags to enable the fcma target feature. Example: RUSTFLAGS="-C target-feature=+fcma" cargo bench

zetanumbers commented 10 months ago

If someone wants to try my version, use the latest nightly compiler, and add rust flags to enable the fcma target feature. Example: RUSTFLAGS="-C target-feature=+fcma" cargo bench

And where to get this version?

HEnquist commented 10 months ago

Here: https://github.com/HEnquist/RustFFT/tree/arm_complex_mul

ejmahler commented 10 months ago

And also, not sure how target feature handling will be done. NeonFcmaPlanner?

The avx planner already distinguishes between the presence of avx vs avx2. In retrospect i should just have made the whole thing require avx2, but a silver lining is that now we can just repeat that pattern for these features.

ejmahler commented 9 months ago

Here: https://github.com/HEnquist/RustFFT/tree/arm_complex_mul

This is great. I took a look at this, and there's some cool stuff here. I noticed your implementation of rotate_and_add and realized your technique of "abusing" the FCMA instrinsic's accumulator parameter by accumulating it with unrelated things could also be used for butterfly 3. Here, instead of having an explicit rotate step like we have in the main implementation, we use vcmlaq_rot90 to roll the rotation into the output FMA instructions, taking a few instructions out of the dependency chain for this function:

#[inline(always)]
    pub(crate) unsafe fn perform_parallel_fft_direct(
        &self,
        value0: float32x4_t,
        value1: float32x4_t,
        value2: float32x4_t,
    ) -> [float32x4_t; 3] {
        let x12p = vaddq_f32(value1, value2);
        let x12n = vsubq_f32(value1, value2);

        let temp = vfmaq_f32(value0, self.twiddle1re, x12p);

        let x0 = vaddq_f32(value0, x12p);
        let x1 = vcmlaq_rot90_f32(temp, self.twiddle1im, x12n);
        let x2 = vcmlaq_rot270_f32(temp, self.twiddle1im, x12n);
        [x0, x1, x2]
    }

Old benchmarks (this includes the FMA optimizations i merged earlier today, but not the code I pasted above:

test neon3_butterfly32_03            ... bench:         531 ns/iter (+/- 10)
test neon3_butterfly32_06            ... bench:       1,274 ns/iter (+/- 1)
test neon3_butterfly32_09            ... bench:       2,785 ns/iter (+/- 6)
test neon3_butterfly32_12            ... bench:       3,029 ns/iter (+/- 5)
test neon3_butterfly32_15            ... bench:       5,800 ns/iter (+/- 3)
test neon3_butterfly64_03            ... bench:         760 ns/iter (+/- 2)
test neon3_butterfly64_06            ... bench:       1,668 ns/iter (+/- 4)
test neon3_butterfly64_09            ... bench:       4,323 ns/iter (+/- 84)
test neon3_butterfly64_12            ... bench:       4,859 ns/iter (+/- 11)
test neon3_butterfly64_15            ... bench:       9,799 ns/iter (+/- 72)

New benchmarks:

test neon3_butterfly32_03            ... bench:         494 ns/iter (+/- 2)
test neon3_butterfly32_06            ... bench:       1,210 ns/iter (+/- 31)
test neon3_butterfly32_09            ... bench:       2,535 ns/iter (+/- 3)
test neon3_butterfly32_12            ... bench:       2,873 ns/iter (+/- 2)
test neon3_butterfly32_15            ... bench:       5,624 ns/iter (+/- 4)
test neon3_butterfly64_03            ... bench:         681 ns/iter (+/- 2)
test neon3_butterfly64_06            ... bench:       1,520 ns/iter (+/- 2)
test neon3_butterfly64_09            ... bench:       3,897 ns/iter (+/- 6)
test neon3_butterfly64_12            ... bench:       4,466 ns/iter (+/- 4)
test neon3_butterfly64_15            ... bench:       9,352 ns/iter (+/- 12)

Mild gains but every bit helps, especially on top of everything else we'd be getting out of FCMA instructions.

I made a quick test here: HEnquist@8ca7cd0

I'm surprised it's implemented in two intrinsics, not one. I thought one vcmla_f64 call would have been enough.

I'm glad they split it into two, because if it was just one, the above optimization wouldn't be possible!

HEnquist commented 9 months ago

I put this a bit to the side for now, waiting for these intrinsics to leave the experimental state. It could of course be released in some preview version that requires a nightly compiler. Is that worth the effort? I'd say that depends on how long it takes for the intrinsics to make it into stable, I haven't checked if there is some plan for this.

ejmahler commented 9 months ago

I'm fine with merging it as a nightly-only opt in feature. That's what we did for neon in the first place.

I'm gonna close this issue since I think we need to decline using accelerate, unfortunately. Hopefully once we get fcma in it'll close the gap.

ejmahler / RustFFT

DFT via Apple Accelerate framework #126

Benchmarks

Comparison benchmarks for f32

Comparison benchmarks for f64