Closed HEnquist closed 3 years ago
Compared to the scalar version, on a Raspberry Pi 4.
This is great! If this depends on things that are in nightly, it will be quite a while before we can include this PR in a release. But I definitely am interested in getting it in.
I’m very curious to see how a more powerful machine handles it. I’m also interested to see how the assembly differs between x64 and arm targets.
On Fri, Sep 24, 2021 at 3:33 PM Henrik Enquist @.***> wrote:
[image: neon_p2comp] https://user-images.githubusercontent.com/6504678/134746092-3f433f8a-1c76-4052-a0b3-140dd246ee82.png Compared to the scalar version, on a Raspberry Pi 4.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ejmahler/RustFFT/pull/78#issuecomment-926953333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI2M6TLMETTHPUNH7X4AQ3UDT4DTANCNFSM5EV4RHVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Yes I would also like to see how this performs on a more powerful cpu. I'm hoping to find something I can borrow, a Mac with the M1 chip for example.
The need for nightly to use Neon will probably remain for some time. There is a tracking issue here: https://github.com/rust-lang/rust/issues/48556
The assembly contains the expected instructions, but I haven't compared to the SSE version. I'll do that and add here!
I tried with the float32 4-point butterfly.
SSE:
_ZN7rustfft3sse15sse_butterflies25SseF32Butterfly4$LT$T$GT$22perform_fft_contiguous17h49da046bbff55ab4E:
.cfi_startproc
movups (%rsi), %xmm0
movups 16(%rsi), %xmm1
movaps %xmm0, %xmm2
addps %xmm1, %xmm2
subps %xmm1, %xmm0
shufps $180, %xmm0, %xmm0
xorps (%rdi), %xmm0
movaps %xmm2, %xmm1
movlhps %xmm0, %xmm1
movhlps %xmm2, %xmm0
movaps %xmm1, %xmm2
addps %xmm0, %xmm2
subps %xmm0, %xmm1
movups %xmm2, (%rcx)
movups %xmm1, 16(%rcx)
retq
.Lfunc_end139:
Neon:
_ZN7rustfft4neon16neon_butterflies26NeonF32Butterfly4$LT$T$GT$22perform_fft_contiguous17h955e89a7bbfe2845E:
.cfi_startproc
ldp q0, q1, [x1]
ldr d2, [x0, #16]
fadd v3.4s, v0.4s, v1.4s
fsub v0.4s, v0.4s, v1.4s
ext v1.16b, v0.16b, v0.16b, #8
rev64 v1.2s, v1.2s
eor v1.8b, v1.8b, v2.8b
mov v2.16b, v3.16b
mov v2.d[1], v0.d[0]
mov v0.d[1], v1.d[0]
ext v0.16b, v0.16b, v3.16b, #8
ext v0.16b, v3.16b, v0.16b, #8
fadd v1.4s, v2.4s, v0.4s
fsub v0.4s, v2.4s, v0.4s
stp q1, q0, [x3]
ret
.Lfunc_end178:
And just for fun, the scalar x86_64:
_ZN7rustfft9algorithm11butterflies19Butterfly4$LT$T$GT$22perform_fft_contiguous17h6c658b75f901ae2cE:
.cfi_startproc
movsd (%rsi), %xmm2
movsd 8(%rsi), %xmm1
movsd 16(%rsi), %xmm3
movsd 24(%rsi), %xmm4
movaps %xmm2, %xmm0
addps %xmm3, %xmm0
subps %xmm3, %xmm2
movlhps %xmm2, %xmm0
movaps %xmm1, %xmm2
addps %xmm4, %xmm2
subps %xmm4, %xmm1
movaps %xmm1, %xmm3
shufps $85, %xmm1, %xmm3
movaps .LCPI94_0(%rip), %xmm4
testb %dil, %dil
je .LBB94_1
xorps %xmm4, %xmm3
jmp .LBB94_3
.LBB94_1:
xorps %xmm4, %xmm1
.LBB94_3:
movlhps %xmm3, %xmm1
shufps $36, %xmm1, %xmm2
movaps %xmm0, %xmm1
addps %xmm2, %xmm1
subps %xmm2, %xmm0
movups %xmm1, (%rcx)
movups %xmm0, 16(%rcx)
retq
.Lfunc_end94:
I realized I could run this on a trial Amazon EC2 VM, with the Graviton2 CPU. It's supposedly based on the Cortex A76 core, which is a big upgrade from the A72. It does indeed perform quite a bit better:
How do cpu features work on arm? Does it just compile a fallback? Or does it trigger UB like on x86? I ask because it seems like there’s no way to query for support for new instructions like that
On Sun, Sep 26, 2021 at 5:09 PM Henrik Enquist @.***> wrote:
@.**** commented on this pull request.
In src/neon/neon_utils.rs https://github.com/ejmahler/RustFFT/pull/78#discussion_r716285012:
+} + +// transpose a 2x2 complex matrix given as [x0, x1], [x2, x3] +// result is [x0, x2], [x1, x3] +#[inline(always)] +pub unsafe fn transpose_complex_2x2_f32(left: float32x4_t, right: float32x4_t) -> [float32x4_t; 2] {
- let temp02 = extract_lo_lo_f32(left, right);
- let temp13 = extract_hi_hi_f32(left, right);
- [temp02, temp13] +}
+// Complex multiplication. +// Each input contains two complex values, which are multiplied in parallel. +#[inline(always)] +pub unsafe fn mul_complex_f32(left: float32x4_t, right: float32x4_t) -> float32x4_t {
- // ARMv8.2-A introduced vcmulq_f32 and vcmlaq_f32 for complex multiplication, these intrinsics are not yet available.
I have looked a bit more at these instructions after writing that. They are a quite recent addition from (IIRC) ARMv8.4. So there aren't that many arm chips out there that have them (most are ARMv8.2). It might be better to leave these functions like they are. I don't know if anyone is planning on adding them to Rust.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ejmahler/RustFFT/pull/78#discussion_r716285012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI2M6WEMSCM3FYBYNJ2X5TUD6Y3ZANCNFSM5EV4RHVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
It seems to work the same on arm, it exits with a SIGILL when I tell rustc to enable for example v8.2a on a cpu that only supports v8-a.
There is the is_aarch64_feature_detected
macro to check features. The complex floating point math needs the "fcma" feature: https://github.com/rust-lang/stdarch/blob/master/crates/std_detect/src/detect/arch/aarch64.rs#L138
Fascinating. Well, at any rate i wouldn't want to add required support for it, if it's that new. Once it stabilizes, I can see doing something like rader's algorithm and avx2 where there's a fallback that doesn't require it.
The fixes were merged so now the normal nightly compiler can be used. The current state then is that the neon stuff is completely disabled on anything not aarch64. The stable compiler can be used like usual. On aarch64, by default it's also disabled and compiles on stable. But when enabling the neon
feature, the neon code is enabled and it requires the latest nightly compiler.
Would you be ok with releasing a version that has nightly-only stuff hidden behind a feature like that? I noticed there are quite a few crates that to that, for example rand: https://crates.io/crates/rand
It seems I was more than a little confused about the VCMUL/VCMLA etc instructions (I blame the messy ARM documentation!). I'm still a little confused, but I think that so far they have only been included as an optional extra in the Cortex M55 meant for embedded applications. Not even the big fancy Cortex X1 has them. So probably not something we should be waiting for! I should probably remove the comment about them.
I think we should make a release, and default the neon feature to disabled. Once a stable release has been out for 6months, we can default it to enabled
I’ve already skimmed through it, but I’ll do a more through review in a day or two
On Tue, Sep 28, 2021 at 12:16 PM Henrik Enquist @.***> wrote:
It seems I was more than a little confused about the VCMUL/VCMLA etc instructions (I blame the messy ARM documentation!). I'm still a little confused, but I think that so far they have only been included as an optional extra in the Cortex M55 meant for embedded applications. Not even the big fancy Cortex X1 has them. So probably not something we should be waiting for! I should probably remove the comment about them.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ejmahler/RustFFT/pull/78#issuecomment-929550360, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI2M6S25ORBVO7H6J6KJNTUEIIBNANCNFSM5EV4RHVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I saw that the neon interleaved load and store instructions got added to the stdarch library. These are quite useful and first results were promising. But then I seem to have hit some bug in rustc, see https://github.com/rust-lang/stdarch/issues/1227. Not sure if stdarch was the right place to file the issue, hopefully someone can give some advice. I can run cargo test, check, etc just fine, but benches crash hard.
I'm assuming that these instructions will make it into the nightly rust builds quite soon. But I have no idea how easy it will be to get the benches running again. I'm leaning against waiting with this PR until this stuff has been sorted out.
This is the branch that uses the new intrinsics: https://github.com/HEnquist/RustFFT/tree/vldx
I updated the name of the feature.
I'm satisfied with this PR at this point. I noticed that it's still marked as draft. Do you think it's ready? If so, mark it ready and I'll merge. If you think there's still work to do, no rush.
The one remaining review item I have centers around the pattern of writing let input_packed = read_complex_to_array!(input, {0, 2, 4, 6, 8, 10});
I noticed that in other places in this PR, there's an intrinsic to load/store data in an interleaved way. do you think that could be applied here? It doesn't need to be done as a part of this PR but it could be a future optimization.
Very nice! Thanks for merging :) The interleaved loading and storing intrinsics aren't included in the nightly rust builds yet. Probably a good thing, since using them triggers some bug that crashes rustc. I have a nearly complete implementation using them. I was thinking of submitting it as a separate PR once that bug is fixed and the normal nightly releases can be used. The bench results I managed to get from it looked quite promising, but I won't know for sure until I can compile the benches without crashing..
Neon on aarch64 will be available on stable rustc from version 1.61!
That should be released in May, but is available as nightly already.
Unfortunately the bug that crashes rustc when using interleaved load/store instructions is still there, so implementing that will have to wait (I'm guessing it will give a 5-10% speedup).
Does this seem like a reasonable plan?
neon-nightly
to just neon
as soon as I can, with neon
feature disabled by default.neon
by default.I think that's a good plan. And we can just document that if you want to enable the neon feature, you need rusc 1.61 or newer.
I'm thinking about how to document+test this long-term, once we enable it by default. We could say "rustfft 6.2 requires rustc 1.6x if you're on aarch64 with the 'neon' feature enabled, or rustc 1.37 in all other configurations". Or will it be less confusing to just require 1.6x across the board? I don't know of any other features from recent rust versions that I want to use, but maybe from a user experience perspective it'll be easier for people to wrap their head around than requiring multiple versions.
We'll need to update our testing script to specifically test rustc 1.61 stable with neon enabled.
If we want to keep the requirement at 1.37 for everything except aarch64+neon, we could add a simple build script that checks this. Something like this: https://github.com/HEnquist/camilladsp/blob/next100/build.rs That makes it possible to give a clear error message instead of failing with a long list of strange compiler errors. But there can still be some confusion, so just requiring 1.61 across the board might be the better way. Not easy..
Now that all the needed intrinsics are available, it's time for some Neon! This is basically a direct translation of the SSE code. Running on Cortex A72 (a Raspberry Pi4), I get a speedup of about 50% for f32, and none for f64. The Neon unit of the A72 can only execute a single 128-bit operation at a time. But it can do two f64 operations in parallel, meaning there isn't really any advantage to Neon here. More advanced cores should do better. To build this, you need a compiler that has this merged: https://github.com/rust-lang/rust/pull/89145 Reason is here: https://github.com/rust-lang/stdarch/issues/1220 Once the latest nightly can be used, I'll add a CI job.