Lokathor / wide

A crate to help you go wide. By which I mean use SIMD stuff.
https://docs.rs/wide
zlib License
288 stars 24 forks source link

Clarify interaction with multiversion #146

Closed phi-gamma closed 9 months ago

phi-gamma commented 9 months ago

After slapping multiversion attributes on functions that use wide types I expected the dispatched versions to use AVX2 like they do with packed_simd. That’s not the case however in my experiments. LLVM still generates slightly better code than in the default version (with v*pd instructions) but doesn’t use any 256 bit registers. :/ Can packed_simd’s behavior be achieved here in stable code?

Example:

use multiversion::multiversion;

#[cfg(not(feature = "unstable"))]
use wide::f64x4;

#[cfg(feature = "unstable")]
use packed_simd::f64x4;

#[multiversion(targets = "simd")]
#[inline(never)]
fn do_stuff(a: [f64; 4], b: [f64; 4]) -> f64 {
    let c = f64x4::from(a) * f64x4::from(b);

    #[cfg(feature = "unstable")]
    {
        return c.sum();
    }

    #[cfg(not(feature = "unstable"))]
    {
        return c.reduce_add();
    }
}

fn main() {
    let a = [4.0, 5.0, 6.0, 7.0];
    let b = [0.0, 1.0, 2.0, 3.0];
    let x = do_stuff(a, b);
    eprintln!("»»» {}", x);
}

Codegen difference wide vs. packed_simd:

$ cargo asm --att -- do_stuff_avx_avx2_fma_sse_sse2_sse3_sse41_ssse3_version >wide.S
    Finished release [optimized] target(s) in 0.02s

$ cargo +nightly asm --att --features=unstable -- do_stuff_avx_avx2_fma_sse_sse2_sse3_sse41_ssse3_version >packed_simd.S
    Finished release [optimized] target(s) in 0.07s

$ diff -u wide.S packed_simd.S
--- wide.S  2024-01-04 23:26:42.150031952 +0100
+++ packed_simd.S   2024-01-04 23:26:51.063032188 +0100
@@ -3,15 +3,15 @@
    .type   simd_investigate::do_stuff::do_stuff_avx_avx2_fma_sse_sse2_sse3_sse41_ssse3_version,@function
 simd_investigate::do_stuff::do_stuff_avx_avx2_fma_sse_sse2_sse3_sse41_ssse3_version:
    .cfi_startproc
-   vmovupd (%rdi), %xmm0
-   vmovupd 16(%rdi), %xmm1
-   vmulpd (%rsi), %xmm0, %xmm0
-   vmulpd 16(%rsi), %xmm1, %xmm1
-   vunpcklpd %xmm1, %xmm0, %xmm2
-   vxorpd %xmm3, %xmm3, %xmm3
-   vaddpd %xmm3, %xmm2, %xmm2
-   vunpckhpd %xmm1, %xmm0, %xmm0
-   vaddpd %xmm2, %xmm0, %xmm0
-   vshufpd $1, %xmm0, %xmm0, %xmm1
-   vaddsd %xmm1, %xmm0, %xmm0
+   vmovupd (%rdi), %ymm0
+   vmulpd (%rsi), %ymm0, %ymm0
+   vxorpd %xmm1, %xmm1, %xmm1
+   vaddsd %xmm1, %xmm0, %xmm1
+   vshufpd $1, %xmm0, %xmm0, %xmm2
+   vaddsd %xmm2, %xmm1, %xmm1
+   vextractf128 $1, %ymm0, %xmm0
+   vaddsd %xmm0, %xmm1, %xmm1
+   vshufpd $1, %xmm0, %xmm0, %xmm0
+   vaddsd %xmm0, %xmm1, %xmm0
+   vzeroupper
    retq
Lokathor commented 9 months ago

internally wide uses safe_arch and compile time configuration to pick what to do. I don't think that interacts well with how the multiversion crate works, since the avx versioning will still see the sse cfg only and pick sse functions.

phi-gamma commented 9 months ago

A bit disappointing but it makes sense, thanks! Back to writing unsafe code :/

Lokathor commented 9 months ago

If you use the nightly core simd api then it should behave more like you're wanting.

phi-gamma commented 9 months ago

Yeah switching to nightly would solve most of my problems but unfortunately that isn’t an option for my usecase.

I wonder if there’s a way to make Wide dispatch depending on is_x86_feature_detected!.

Lokathor commented 9 months ago

The basic problem is that wide has tons and tons of "small" functions, and having each of them do a branch adds up very quickly.

I'm happy to add new types and/or new methods if you come up with anything solid though.

phi-gamma commented 9 months ago

Dispatch with multiversion is done once, afterwards it only costs one relaxed load. But yeah that’s gonna be tricky to reconcile with the static type dispatch in wide.