Clarify interaction with multiversion

phi-gamma commented 9 months ago

After slapping multiversion attributes on functions that use wide types I expected the dispatched versions to use AVX2 like they do with packed_simd. That’s not the case however in my experiments. LLVM still generates slightly better code than in the default version (with v*pd instructions) but doesn’t use any 256 bit registers. :/ Can packed_simd’s behavior be achieved here in stable code?

Example:

use multiversion::multiversion;

#[cfg(not(feature = "unstable"))]
use wide::f64x4;

#[cfg(feature = "unstable")]
use packed_simd::f64x4;

#[multiversion(targets = "simd")]
#[inline(never)]
fn do_stuff(a: [f64; 4], b: [f64; 4]) -> f64 {
    let c = f64x4::from(a) * f64x4::from(b);

    #[cfg(feature = "unstable")]
    {
        return c.sum();
    }

    #[cfg(not(feature = "unstable"))]
    {
        return c.reduce_add();
    }
}

fn main() {
    let a = [4.0, 5.0, 6.0, 7.0];
    let b = [0.0, 1.0, 2.0, 3.0];
    let x = do_stuff(a, b);
    eprintln!("»»» {}", x);
}

Codegen difference wide vs. packed_simd:

$ cargo asm --att -- do_stuff_avx_avx2_fma_sse_sse2_sse3_sse41_ssse3_version >wide.S
    Finished release [optimized] target(s) in 0.02s

$ cargo +nightly asm --att --features=unstable -- do_stuff_avx_avx2_fma_sse_sse2_sse3_sse41_ssse3_version >packed_simd.S
    Finished release [optimized] target(s) in 0.07s

$ diff -u wide.S packed_simd.S
--- wide.S  2024-01-04 23:26:42.150031952 +0100
+++ packed_simd.S   2024-01-04 23:26:51.063032188 +0100
@@ -3,15 +3,15 @@
    .type   simd_investigate::do_stuff::do_stuff_avx_avx2_fma_sse_sse2_sse3_sse41_ssse3_version,@function
 simd_investigate::do_stuff::do_stuff_avx_avx2_fma_sse_sse2_sse3_sse41_ssse3_version:
    .cfi_startproc
-   vmovupd (%rdi), %xmm0
-   vmovupd 16(%rdi), %xmm1
-   vmulpd (%rsi), %xmm0, %xmm0
-   vmulpd 16(%rsi), %xmm1, %xmm1
-   vunpcklpd %xmm1, %xmm0, %xmm2
-   vxorpd %xmm3, %xmm3, %xmm3
-   vaddpd %xmm3, %xmm2, %xmm2
-   vunpckhpd %xmm1, %xmm0, %xmm0
-   vaddpd %xmm2, %xmm0, %xmm0
-   vshufpd $1, %xmm0, %xmm0, %xmm1
-   vaddsd %xmm1, %xmm0, %xmm0
+   vmovupd (%rdi), %ymm0
+   vmulpd (%rsi), %ymm0, %ymm0
+   vxorpd %xmm1, %xmm1, %xmm1
+   vaddsd %xmm1, %xmm0, %xmm1
+   vshufpd $1, %xmm0, %xmm0, %xmm2
+   vaddsd %xmm2, %xmm1, %xmm1
+   vextractf128 $1, %ymm0, %xmm0
+   vaddsd %xmm0, %xmm1, %xmm1
+   vshufpd $1, %xmm0, %xmm0, %xmm0
+   vaddsd %xmm0, %xmm1, %xmm0
+   vzeroupper
    retq

Lokathor commented 9 months ago

internally wide uses safe_arch and compile time configuration to pick what to do. I don't think that interacts well with how the multiversion crate works, since the avx versioning will still see the sse cfg only and pick sse functions.

phi-gamma commented 9 months ago

A bit disappointing but it makes sense, thanks! Back to writing unsafe code :/

Lokathor commented 9 months ago

If you use the nightly core simd api then it should behave more like you're wanting.

phi-gamma commented 9 months ago

Yeah switching to nightly would solve most of my problems but unfortunately that isn’t an option for my usecase.

I wonder if there’s a way to make Wide dispatch depending on is_x86_feature_detected!.

Lokathor commented 9 months ago

The basic problem is that wide has tons and tons of "small" functions, and having each of them do a branch adds up very quickly.

I'm happy to add new types and/or new methods if you come up with anything solid though.

phi-gamma commented 9 months ago

Dispatch with multiversion is done once, afterwards it only costs one relaxed load. But yeah that’s gonna be tricky to reconcile with the static type dispatch in wide.

Lokathor / wide

Clarify interaction with multiversion #146