ejmahler / RustFFT

RustFFT is a high-performance FFT library written in pure Rust.
Apache License 2.0
692 stars 48 forks source link

Neon #78

Closed HEnquist closed 3 years ago

HEnquist commented 3 years ago

Now that all the needed intrinsics are available, it's time for some Neon! This is basically a direct translation of the SSE code. Running on Cortex A72 (a Raspberry Pi4), I get a speedup of about 50% for f32, and none for f64. The Neon unit of the A72 can only execute a single 128-bit operation at a time. But it can do two f64 operations in parallel, meaning there isn't really any advantage to Neon here. More advanced cores should do better. To build this, you need a compiler that has this merged: https://github.com/rust-lang/rust/pull/89145 Reason is here: https://github.com/rust-lang/stdarch/issues/1220 Once the latest nightly can be used, I'll add a CI job.

HEnquist commented 3 years ago

neon_p2comp Compared to the scalar version, on a Raspberry Pi 4.

ejmahler commented 3 years ago

This is great! If this depends on things that are in nightly, it will be quite a while before we can include this PR in a release. But I definitely am interested in getting it in.

I’m very curious to see how a more powerful machine handles it. I’m also interested to see how the assembly differs between x64 and arm targets.

On Fri, Sep 24, 2021 at 3:33 PM Henrik Enquist @.***> wrote:

[image: neon_p2comp] https://user-images.githubusercontent.com/6504678/134746092-3f433f8a-1c76-4052-a0b3-140dd246ee82.png Compared to the scalar version, on a Raspberry Pi 4.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ejmahler/RustFFT/pull/78#issuecomment-926953333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI2M6TLMETTHPUNH7X4AQ3UDT4DTANCNFSM5EV4RHVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

HEnquist commented 3 years ago

Yes I would also like to see how this performs on a more powerful cpu. I'm hoping to find something I can borrow, a Mac with the M1 chip for example.

The need for nightly to use Neon will probably remain for some time. There is a tracking issue here: https://github.com/rust-lang/rust/issues/48556

The assembly contains the expected instructions, but I haven't compared to the SSE version. I'll do that and add here!

HEnquist commented 3 years ago

I tried with the float32 4-point butterfly.

SSE:

_ZN7rustfft3sse15sse_butterflies25SseF32Butterfly4$LT$T$GT$22perform_fft_contiguous17h49da046bbff55ab4E:
    .cfi_startproc
    movups  (%rsi), %xmm0
    movups  16(%rsi), %xmm1
    movaps  %xmm0, %xmm2
    addps   %xmm1, %xmm2
    subps   %xmm1, %xmm0
    shufps  $180, %xmm0, %xmm0
    xorps   (%rdi), %xmm0
    movaps  %xmm2, %xmm1
    movlhps %xmm0, %xmm1
    movhlps %xmm2, %xmm0
    movaps  %xmm1, %xmm2
    addps   %xmm0, %xmm2
    subps   %xmm0, %xmm1
    movups  %xmm2, (%rcx)
    movups  %xmm1, 16(%rcx)
    retq
.Lfunc_end139:

Neon:

_ZN7rustfft4neon16neon_butterflies26NeonF32Butterfly4$LT$T$GT$22perform_fft_contiguous17h955e89a7bbfe2845E:
    .cfi_startproc
    ldp q0, q1, [x1]
    ldr d2, [x0, #16]
    fadd    v3.4s, v0.4s, v1.4s
    fsub    v0.4s, v0.4s, v1.4s
    ext v1.16b, v0.16b, v0.16b, #8
    rev64   v1.2s, v1.2s
    eor v1.8b, v1.8b, v2.8b
    mov v2.16b, v3.16b
    mov v2.d[1], v0.d[0]
    mov v0.d[1], v1.d[0]
    ext v0.16b, v0.16b, v3.16b, #8
    ext v0.16b, v3.16b, v0.16b, #8
    fadd    v1.4s, v2.4s, v0.4s
    fsub    v0.4s, v2.4s, v0.4s
    stp q1, q0, [x3]
    ret
.Lfunc_end178:

And just for fun, the scalar x86_64:

_ZN7rustfft9algorithm11butterflies19Butterfly4$LT$T$GT$22perform_fft_contiguous17h6c658b75f901ae2cE:
    .cfi_startproc
    movsd   (%rsi), %xmm2
    movsd   8(%rsi), %xmm1
    movsd   16(%rsi), %xmm3
    movsd   24(%rsi), %xmm4
    movaps  %xmm2, %xmm0
    addps   %xmm3, %xmm0
    subps   %xmm3, %xmm2
    movlhps %xmm2, %xmm0
    movaps  %xmm1, %xmm2
    addps   %xmm4, %xmm2
    subps   %xmm4, %xmm1
    movaps  %xmm1, %xmm3
    shufps  $85, %xmm1, %xmm3
    movaps  .LCPI94_0(%rip), %xmm4
    testb   %dil, %dil
    je  .LBB94_1
    xorps   %xmm4, %xmm3
    jmp .LBB94_3
.LBB94_1:
    xorps   %xmm4, %xmm1
.LBB94_3:
    movlhps %xmm3, %xmm1
    shufps  $36, %xmm1, %xmm2
    movaps  %xmm0, %xmm1
    addps   %xmm2, %xmm1
    subps   %xmm2, %xmm0
    movups  %xmm1, (%rcx)
    movups  %xmm0, 16(%rcx)
    retq
.Lfunc_end94:
HEnquist commented 3 years ago

I realized I could run this on a trial Amazon EC2 VM, with the Graviton2 CPU. It's supposedly based on the Cortex A76 core, which is a big upgrade from the A72. It does indeed perform quite a bit better: rustfft_comp_aws

ejmahler commented 3 years ago

How do cpu features work on arm? Does it just compile a fallback? Or does it trigger UB like on x86? I ask because it seems like there’s no way to query for support for new instructions like that

On Sun, Sep 26, 2021 at 5:09 PM Henrik Enquist @.***> wrote:

@.**** commented on this pull request.

In src/neon/neon_utils.rs https://github.com/ejmahler/RustFFT/pull/78#discussion_r716285012:

+} + +// transpose a 2x2 complex matrix given as [x0, x1], [x2, x3] +// result is [x0, x2], [x1, x3] +#[inline(always)] +pub unsafe fn transpose_complex_2x2_f32(left: float32x4_t, right: float32x4_t) -> [float32x4_t; 2] {

  • let temp02 = extract_lo_lo_f32(left, right);
  • let temp13 = extract_hi_hi_f32(left, right);
  • [temp02, temp13] +}
  • +// Complex multiplication. +// Each input contains two complex values, which are multiplied in parallel. +#[inline(always)] +pub unsafe fn mul_complex_f32(left: float32x4_t, right: float32x4_t) -> float32x4_t {

  • // ARMv8.2-A introduced vcmulq_f32 and vcmlaq_f32 for complex multiplication, these intrinsics are not yet available.

I have looked a bit more at these instructions after writing that. They are a quite recent addition from (IIRC) ARMv8.4. So there aren't that many arm chips out there that have them (most are ARMv8.2). It might be better to leave these functions like they are. I don't know if anyone is planning on adding them to Rust.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ejmahler/RustFFT/pull/78#discussion_r716285012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI2M6WEMSCM3FYBYNJ2X5TUD6Y3ZANCNFSM5EV4RHVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

HEnquist commented 3 years ago

It seems to work the same on arm, it exits with a SIGILL when I tell rustc to enable for example v8.2a on a cpu that only supports v8-a. There is the is_aarch64_feature_detected macro to check features. The complex floating point math needs the "fcma" feature: https://github.com/rust-lang/stdarch/blob/master/crates/std_detect/src/detect/arch/aarch64.rs#L138

ejmahler commented 3 years ago

Fascinating. Well, at any rate i wouldn't want to add required support for it, if it's that new. Once it stabilizes, I can see doing something like rader's algorithm and avx2 where there's a fallback that doesn't require it.

HEnquist commented 3 years ago

The fixes were merged so now the normal nightly compiler can be used. The current state then is that the neon stuff is completely disabled on anything not aarch64. The stable compiler can be used like usual. On aarch64, by default it's also disabled and compiles on stable. But when enabling the neon feature, the neon code is enabled and it requires the latest nightly compiler.

Would you be ok with releasing a version that has nightly-only stuff hidden behind a feature like that? I noticed there are quite a few crates that to that, for example rand: https://crates.io/crates/rand

HEnquist commented 3 years ago

It seems I was more than a little confused about the VCMUL/VCMLA etc instructions (I blame the messy ARM documentation!). I'm still a little confused, but I think that so far they have only been included as an optional extra in the Cortex M55 meant for embedded applications. Not even the big fancy Cortex X1 has them. So probably not something we should be waiting for! I should probably remove the comment about them.

ejmahler commented 3 years ago

I think we should make a release, and default the neon feature to disabled. Once a stable release has been out for 6months, we can default it to enabled

I’ve already skimmed through it, but I’ll do a more through review in a day or two

On Tue, Sep 28, 2021 at 12:16 PM Henrik Enquist @.***> wrote:

It seems I was more than a little confused about the VCMUL/VCMLA etc instructions (I blame the messy ARM documentation!). I'm still a little confused, but I think that so far they have only been included as an optional extra in the Cortex M55 meant for embedded applications. Not even the big fancy Cortex X1 has them. So probably not something we should be waiting for! I should probably remove the comment about them.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ejmahler/RustFFT/pull/78#issuecomment-929550360, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI2M6S25ORBVO7H6J6KJNTUEIIBNANCNFSM5EV4RHVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

HEnquist commented 3 years ago

I saw that the neon interleaved load and store instructions got added to the stdarch library. These are quite useful and first results were promising. But then I seem to have hit some bug in rustc, see https://github.com/rust-lang/stdarch/issues/1227. Not sure if stdarch was the right place to file the issue, hopefully someone can give some advice. I can run cargo test, check, etc just fine, but benches crash hard.

I'm assuming that these instructions will make it into the nightly rust builds quite soon. But I have no idea how easy it will be to get the benches running again. I'm leaning against waiting with this PR until this stuff has been sorted out.

This is the branch that uses the new intrinsics: https://github.com/HEnquist/RustFFT/tree/vldx

ejmahler commented 3 years ago

I updated the name of the feature.

I'm satisfied with this PR at this point. I noticed that it's still marked as draft. Do you think it's ready? If so, mark it ready and I'll merge. If you think there's still work to do, no rush.

The one remaining review item I have centers around the pattern of writing let input_packed = read_complex_to_array!(input, {0, 2, 4, 6, 8, 10});

I noticed that in other places in this PR, there's an intrinsic to load/store data in an interleaved way. do you think that could be applied here? It doesn't need to be done as a part of this PR but it could be a future optimization.

HEnquist commented 3 years ago

Very nice! Thanks for merging :) The interleaved loading and storing intrinsics aren't included in the nightly rust builds yet. Probably a good thing, since using them triggers some bug that crashes rustc. I have a nearly complete implementation using them. I was thinking of submitting it as a separate PR once that bug is fixed and the normal nightly releases can be used. The bench results I managed to get from it looked quite promising, but I won't know for sure until I can compile the benches without crashing..

HEnquist commented 2 years ago

Neon on aarch64 will be available on stable rustc from version 1.61! That should be released in May, but is available as nightly already.
Unfortunately the bug that crashes rustc when using interleaved load/store instructions is still there, so implementing that will have to wait (I'm guessing it will give a 5-10% speedup). Does this seem like a reasonable plan?

ejmahler commented 2 years ago

I think that's a good plan. And we can just document that if you want to enable the neon feature, you need rusc 1.61 or newer.

I'm thinking about how to document+test this long-term, once we enable it by default. We could say "rustfft 6.2 requires rustc 1.6x if you're on aarch64 with the 'neon' feature enabled, or rustc 1.37 in all other configurations". Or will it be less confusing to just require 1.6x across the board? I don't know of any other features from recent rust versions that I want to use, but maybe from a user experience perspective it'll be easier for people to wrap their head around than requiring multiple versions.

We'll need to update our testing script to specifically test rustc 1.61 stable with neon enabled.

HEnquist commented 2 years ago

If we want to keep the requirement at 1.37 for everything except aarch64+neon, we could add a simple build script that checks this. Something like this: https://github.com/HEnquist/camilladsp/blob/next100/build.rs That makes it possible to give a clear error message instead of failing with a long list of strange compiler errors. But there can still be some confusion, so just requiring 1.61 across the board might be the better way. Not easy..