ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
979 stars 58 forks source link

[Rust Bindings] Poor performance VS ndarray (BLAS) and optimized iteration impls #107

Closed ChillFish8 closed 2 months ago

ChillFish8 commented 7 months ago

Recently we've been implementing some spacial distance functions and benchmarking them against some existing libraries, when testing with high dimensional data (1024 dims) we observe simsimd taking on average 619ns per vector, compared to ndarray (when backed by openblas) taking 43ns or an optimized bit of pure Rust taking 234ns and 95ns with ffast-math like intrinsics disabled/enabled respectively.

These benchmarks are taken with Criterion doing 1,000 vector ops per iteration in order to account for any clock accuracy issues due to the low ns time.

dot ndarray 1024 auto   time:   [43.270 µs 43.285 µs 43.302 µs]
Found 17 outliers among 500 measurements (3.40%)
  5 (1.00%) high mild
  12 (2.40%) high severe

Benchmarking dot simsimd 1024 auto: Warming up for 3.0000 s
Warning: Unable to complete 500 samples in 60.0s. You may wish to increase target time to 77.7s, enable flat sampling, or reduce sample count to 310.
dot simsimd 1024 auto   time:   [618.85 µs 619.93 µs 621.15 µs]
Found 43 outliers among 500 measurements (8.60%)
  7 (1.40%) low mild
  17 (3.40%) high mild
  19 (3.80%) high severe

dot fallback 1024 nofma time:   [232.92 µs 234.19 µs 235.76 µs]
Found 16 outliers among 500 measurements (3.20%)
  11 (2.20%) high mild
  5 (1.00%) high severe

dot fallback 1024 fma   time:   [95.456 µs 95.586 µs 95.729 µs]
Found 19 outliers among 500 measurements (3.80%)
  17 (3.40%) high mild
  2 (0.40%) high severe

Notes

Loose benchmark structure (within Criterion)

There is a bit too much code to paste the exact benchmarks, but each step is the following:

fn bench_me(a: &[f32], b: &[f32]) {
   for _ in 0..1_000 {
       black_box(implementation_dot(black_box(a), black_box(b)));
   }
}

Pure Rust impl

Below is a fallback impl I've made, for simplicity I've removed the generic which was used to replace regular math operations with their ffast-math equivalents when running the dot fallback 1024 fma benchmark, however, the asm for dot fallback 1024 nofma are identical.

Notes

unsafe fn fallback_dot_product_demo<const DIMS: usize>(
    a: &[f32],
    b: &[f32],
) -> f32 {
    debug_assert_eq!(
        b.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        a.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        DIMS % 8,
        0,
        "DIMS must be able to fit entirely into chunks of 8 lanes."
    );

    let mut i = 0;

    // We do this manual unrolling to allow the compiler to vectorize
    // the loop and avoid some branching even if we're not doing it explicitly.
    // This made a significant difference in benchmarking ~4x
    let mut acc1 = 0.0;
    let mut acc2 = 0.0;
    let mut acc3 = 0.0;
    let mut acc4 = 0.0;
    let mut acc5 = 0.0;
    let mut acc6 = 0.0;
    let mut acc7 = 0.0;
    let mut acc8 = 0.0;

    while i < a.len() {
        let a1 = *a.get_unchecked(i);
        let a2 = *a.get_unchecked(i + 1);
        let a3 = *a.get_unchecked(i + 2);
        let a4 = *a.get_unchecked(i + 3);
        let a5 = *a.get_unchecked(i + 4);
        let a6 = *a.get_unchecked(i + 5);
        let a7 = *a.get_unchecked(i + 6);
        let a8 = *a.get_unchecked(i + 7);

        let b1 = *b.get_unchecked(i);
        let b2 = *b.get_unchecked(i + 1);
        let b3 = *b.get_unchecked(i + 2);
        let b4 = *b.get_unchecked(i + 3);
        let b5 = *b.get_unchecked(i + 4);
        let b6 = *b.get_unchecked(i + 5);
        let b7 = *b.get_unchecked(i + 6);
        let b8 = *b.get_unchecked(i + 7);

        acc1 = acc1 + (a1 * b1);
        acc2 = acc2 + (a2 * b2);
        acc3 = acc3 + (a3 * b3);
        acc4 = acc4 + (a4 * b4);
        acc5 = acc5 + (a5 * b5);
        acc6 = acc6 + (a6 * b6);
        acc7 = acc7 + (a7 * b7);
        acc8 = acc8 + (a8 * b8);

        i += 8;
    }

    acc1 = acc1 + acc2;
    acc3 = acc3 + acc4;
    acc5 = acc5 + acc6;
    acc7 = acc7 + acc8;

    acc1 = acc1 + acc3;
    acc5 = acc5 + acc7;

    acc1 + acc5
}
ashvardanian commented 7 months ago

Hi @ChillFish8! Which version of SimSIMD are you using?

AVX2 for float32 is practically the only SIMD+datatype combo we don't implement, as that's the only one that compilers vectorize well 😆 But your result is still very weird. Do you have a project I can clone and run to reproduce that?

ChillFish8 commented 7 months ago

I can't currently give access to the project this is ran on, but I can give a copy of the benchmark file minus some of the custom avx stuff, but realistically it is probably best to just worry about simsimd vs rust vs blas for this issue.

cargo.toml

[package]
name = "benchmark-demo"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]

[dev-dependencies]
rand = "0.8.5"
simsimd = "4.2.2"

criterion = { version = "0.5.1", features = ["html_reports"] }

[target.'cfg(unix)'.dev-dependencies]
ndarray = { version = "0.15.6", features = ["blas"] }
blas-src = { version = "0.8", features = ["openblas"] }
openblas-src = { version = "0.10", features = ["cblas", "system"] }

[target.'cfg(not(unix))'.dev-dependencies]
ndarray = "0.15.6"

bench_dot_product.rs

#[cfg(unix)]
extern crate blas_src;

use std::hint::black_box;
use std::time::Duration;

use criterion::{criterion_group, criterion_main, Criterion};
use simsimd::SpatialSimilarity;

fn simsimd_dot(a: &[f32], b: &[f32]) -> f32 {
    f32::dot(a, b).unwrap_or_default() as f32
}

fn ndarray_dot(a: &ndarray::Array1<f32>, b: &ndarray::Array1<f32>) -> f32 {
    a.dot(b)
}

unsafe fn fallback_dot_product_demo<const DIMS: usize>(
    a: &[f32],
    b: &[f32],
) -> f32 {
    debug_assert_eq!(
        b.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        a.len(),
        DIMS,
        "Improper implementation detected, vectors must match constant"
    );
    debug_assert_eq!(
        DIMS % 8,
        0,
        "DIMS must be able to fit entirely into chunks of 8 lanes."
    );

    let mut i = 0;

    // We do this manual unrolling to allow the compiler to vectorize
    // the loop and avoid some branching even if we're not doing it explicitly.
    // This made a significant difference in benchmarking ~4x
    let mut acc1 = 0.0;
    let mut acc2 = 0.0;
    let mut acc3 = 0.0;
    let mut acc4 = 0.0;
    let mut acc5 = 0.0;
    let mut acc6 = 0.0;
    let mut acc7 = 0.0;
    let mut acc8 = 0.0;

    while i < a.len() {
        let a1 = *a.get_unchecked(i);
        let a2 = *a.get_unchecked(i + 1);
        let a3 = *a.get_unchecked(i + 2);
        let a4 = *a.get_unchecked(i + 3);
        let a5 = *a.get_unchecked(i + 4);
        let a6 = *a.get_unchecked(i + 5);
        let a7 = *a.get_unchecked(i + 6);
        let a8 = *a.get_unchecked(i + 7);

        let b1 = *b.get_unchecked(i);
        let b2 = *b.get_unchecked(i + 1);
        let b3 = *b.get_unchecked(i + 2);
        let b4 = *b.get_unchecked(i + 3);
        let b5 = *b.get_unchecked(i + 4);
        let b6 = *b.get_unchecked(i + 5);
        let b7 = *b.get_unchecked(i + 6);
        let b8 = *b.get_unchecked(i + 7);

        acc1 = acc1 + (a1 * b1);
        acc2 = acc2 + (a2 * b2);
        acc3 = acc3 + (a3 * b3);
        acc4 = acc4 + (a4 * b4);
        acc5 = acc5 + (a5 * b5);
        acc6 = acc6 + (a6 * b6);
        acc7 = acc7 + (a7 * b7);
        acc8 = acc8 + (a8 * b8);

        i += 8;
    }

    acc1 = acc1 + acc2;
    acc3 = acc3 + acc4;
    acc5 = acc5 + acc6;
    acc7 = acc7 + acc8;

    acc1 = acc1 + acc3;
    acc5 = acc5 + acc7;

    acc1 + acc5
}

macro_rules! repeat {
    ($n:expr, $val:block) => {{
        for _ in 0..$n {
            black_box($val);
        }
    }};
}

fn criterion_benchmark(c: &mut Criterion) {
    // Hey, this benchmark behaves drastically different if you are on Windows VS unix.
    // This is because on unix we do a more realistic benchmark and compare ndarray backed
    // by openblas rather than with the standard rust impl.
    c.bench_function("dot ndarray 1024 auto", |b| {
        use ndarray::Array1;

        let mut v1 = Vec::new();
        let mut v2 = Vec::new();
        for _ in 0..1024 {
            v1.push(rand::random());
            v2.push(rand::random());
        }

        let v1 = Array1::from_shape_vec((1024,), v1).unwrap();
        let v2 = Array1::from_shape_vec((1024,), v2).unwrap();

        b.iter(|| repeat!(1000, { ndarray_dot(black_box(&v1), black_box(&v2)) }))
    });    
    c.bench_function("dot simsimd 1024 auto", |b| {
        let mut v1 = Vec::new();
        let mut v2 = Vec::new();
        for _ in 0..1024 {
            v1.push(rand::random());
            v2.push(rand::random());
        }

        b.iter(|| repeat!(1000, { simsimd_dot(black_box(&v1), black_box(&v2)) }))
    });
    c.bench_function("dot fallback 1024 nofma", |b| {
        let mut v1 = Vec::new();
        let mut v2 = Vec::new();
        for _ in 0..1024 {
            v1.push(rand::random());
            v2.push(rand::random());
        }

        b.iter(|| repeat!(1000, { 
            unsafe { fallback_dot_product_demo::<1024>(black_box(&v1), black_box(&v2)) }
        }))
    });
}

criterion_group!(
    name = benches;
    config = Criterion::default()
        .measurement_time(Duration::from_secs(60))
        .sample_size(500);
    targets = criterion_benchmark
);
criterion_main!(benches);
ChillFish8 commented 7 months ago

To be more specific, the numbers simsimd is getting for AVX2 and f32 values seem to be more or less in line with iterating through the two vectors and getting the dot product, but without the compiler being able to correctly vectorize the loop. So maybe the compiler for simsimd is not actually vectorizing the loop fully or at all.

ashvardanian commented 7 months ago

The SimSIMD repository contains Rust benchmarks against native implementations. Maybe they are poorly implemented... Can you try cloning the SimSIMD repository and then running the benchmarks, as described in the CONTRIBUTING.md.

cargo bench

Please check out the main branch version and the main-dev. I'd be happy to optimize the kernels further, but I am not sure that is possible. If the issue persists, it might be related to compilation settings 🤗

ChillFish8 commented 7 months ago

Using the repo benches, by default I get:

SIMD Cosine/SimSIMD/0   time:   [990.33 ns 991.20 ns 992.21 ns]
                        change: [-0.5469% -0.1196% +0.1941%] (p = 0.62 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/0
                        time:   [997.99 ns 1.0023 µs 1.0066 µs]
                        change: [+0.8535% +1.1800% +1.5240%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/1   time:   [1.0071 µs 1.0112 µs 1.0159 µs]
                        change: [-0.5979% -0.0751% +0.4182%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
SIMD Cosine/Rust Native/1
                        time:   [995.26 ns 997.31 ns 999.95 ns]
                        change: [-3.4249% -2.3587% -1.4896%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/2   time:   [992.49 ns 993.86 ns 995.36 ns]
                        change: [-0.6670% -0.3172% +0.0164%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/2
                        time:   [999.39 ns 1.0017 µs 1.0040 µs]
                        change: [+0.8312% +1.0924% +1.3528%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/SimSIMD/3   time:   [999.12 ns 1.0029 µs 1.0071 µs]
                        change: [-0.8765% -0.3084% +0.1971%] (p = 0.28 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/3
                        time:   [995.69 ns 997.72 ns 999.69 ns]
                        change: [+0.6852% +0.9139% +1.1508%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
SIMD Cosine/SimSIMD/4   time:   [989.46 ns 991.39 ns 993.36 ns]
                        change: [-2.4808% -1.7419% -1.1702%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
SIMD Cosine/Rust Native/4
                        time:   [984.42 ns 985.22 ns 986.16 ns]
                        change: [-1.9665% -1.4544% -0.9763%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
SIMD Cosine/SimSIMD/5   time:   [984.21 ns 985.94 ns 987.71 ns]
                        change: [-1.6544% -1.1956% -0.8287%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD Cosine/Rust Native/5
                        time:   [987.03 ns 988.30 ns 989.81 ns]
                        change: [+1.0143% +1.1866% +1.3575%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

     Running rust/benches/sqeuclidean.rs (target/release/deps/sqeuclidean-1c498acee1c38350)
Gnuplot not found, using plotters backend
SIMD SqEuclidean/SimSIMD/0
                        time:   [964.05 ns 967.69 ns 971.67 ns]
                        change: [-1.6473% -1.2355% -0.8248%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/0
                        time:   [973.53 ns 975.20 ns 977.10 ns]
                        change: [+186.66% +187.37% +188.16%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
SIMD SqEuclidean/SimSIMD/1
                        time:   [952.89 ns 954.25 ns 955.68 ns]
                        change: [-2.9500% -2.5561% -2.2074%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/1
                        time:   [973.70 ns 975.53 ns 977.30 ns]
                        change: [+186.14% +186.69% +187.28%] (p = 0.00 < 0.05)
                        Performance has regressed.
SIMD SqEuclidean/SimSIMD/2
                        time:   [965.95 ns 968.58 ns 971.30 ns]
                        change: [-1.8963% -1.5119% -1.1299%] (p = 0.00 < 0.05)
                        Performance has improved.
SIMD SqEuclidean/Rust Native/2
                        time:   [971.81 ns 973.68 ns 975.83 ns]
                        change: [+181.90% +183.47% +184.85%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/SimSIMD/3
                        time:   [957.05 ns 958.81 ns 960.71 ns]
                        change: [-3.2849% -2.8105% -2.3846%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/3
                        time:   [971.49 ns 972.77 ns 974.15 ns]
                        change: [+177.36% +179.33% +181.00%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/SimSIMD/4
                        time:   [958.75 ns 962.49 ns 966.77 ns]
                        change: [-2.8413% -2.4086% -2.0098%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
SIMD SqEuclidean/Rust Native/4
                        time:   [977.67 ns 981.15 ns 984.38 ns]
                        change: [+183.37% +184.79% +186.12%] (p = 0.00 < 0.05)
                        Performance has regressed.
SIMD SqEuclidean/SimSIMD/5
                        time:   [957.25 ns 959.29 ns 961.63 ns]
                        change: [-3.4224% -3.1009% -2.8216%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/Rust Native/5
                        time:   [977.04 ns 979.62 ns 982.15 ns]
                        change: [+182.34% +184.11% +185.86%] (p = 0.00 < 0.05)
                        Performance has regressed.
ChillFish8 commented 7 months ago

If I use the changes in PR #108 I get the following:

SIMD Cosine/SimSIMD/0   time:   [995.61 ns 997.99 ns 1.0008 µs]
                        change: [+0.1468% +0.4000% +0.6799%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe
SIMD Cosine/Rust Native/0
                        time:   [755.37 ns 758.73 ns 764.37 ns]
                        change: [-24.342% -24.086% -23.766%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/1   time:   [985.11 ns 986.34 ns 987.60 ns]
                        change: [-2.4883% -2.1633% -1.8513%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/Rust Native/1
                        time:   [752.29 ns 754.33 ns 757.00 ns]
                        change: [-25.113% -24.900% -24.675%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/2   time:   [987.52 ns 988.61 ns 989.83 ns]
                        change: [-0.5561% -0.3441% -0.1497%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe
SIMD Cosine/Rust Native/2
                        time:   [751.62 ns 752.32 ns 753.19 ns]
                        change: [-25.024% -24.896% -24.770%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
SIMD Cosine/SimSIMD/3   time:   [987.02 ns 988.13 ns 989.34 ns]
                        change: [-1.7928% -1.4180% -1.0880%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/Rust Native/3
                        time:   [751.43 ns 751.82 ns 752.29 ns]
                        change: [-25.020% -24.925% -24.828%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/SimSIMD/4   time:   [989.97 ns 990.71 ns 991.66 ns]
                        change: [-0.0446% +0.1065% +0.2536%] (p = 0.17 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
SIMD Cosine/Rust Native/4
                        time:   [750.46 ns 751.02 ns 751.60 ns]
                        change: [-23.947% -23.833% -23.728%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe
SIMD Cosine/SimSIMD/5   time:   [988.47 ns 989.15 ns 989.97 ns]
                        change: [+0.4132% +0.5962% +0.7772%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/5
                        time:   [751.42 ns 752.31 ns 753.38 ns]
                        change: [-24.095% -23.966% -23.843%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

     Running rust/benches/sqeuclidean.rs (target/release/deps/sqeuclidean-1c498acee1c38350)
Gnuplot not found, using plotters backend
SIMD SqEuclidean/SimSIMD/0
                        time:   [954.47 ns 956.11 ns 957.70 ns]
                        change: [-1.1162% -0.7026% -0.3014%] (p = 0.00 < 0.05)
                        Change within noise threshold.
SIMD SqEuclidean/Rust Native/0
                        time:   [366.84 ns 367.18 ns 367.53 ns]
                        change: [-62.453% -62.353% -62.261%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/SimSIMD/1
                        time:   [946.73 ns 947.48 ns 948.28 ns]
                        change: [-0.9722% -0.8084% -0.6503%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
SIMD SqEuclidean/Rust Native/1
                        time:   [365.67 ns 365.83 ns 366.01 ns]
                        change: [-62.469% -62.396% -62.323%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
SIMD SqEuclidean/SimSIMD/2
                        time:   [947.38 ns 949.31 ns 951.74 ns]
                        change: [-2.0238% -1.7564% -1.4912%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) high mild
  4 (4.00%) high severe
SIMD SqEuclidean/Rust Native/2
                        time:   [365.85 ns 366.11 ns 366.40 ns]
                        change: [-62.605% -62.540% -62.476%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
SIMD SqEuclidean/SimSIMD/3
                        time:   [952.75 ns 954.40 ns 956.08 ns]
                        change: [-0.7782% -0.3103% +0.1819%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
SIMD SqEuclidean/Rust Native/3
                        time:   [367.71 ns 368.50 ns 369.52 ns]
                        change: [-62.255% -62.179% -62.096%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
SIMD SqEuclidean/SimSIMD/4
                        time:   [946.24 ns 947.68 ns 949.34 ns]
                        change: [-1.3054% -0.9476% -0.5958%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/Rust Native/4
                        time:   [368.88 ns 370.15 ns 371.65 ns]
                        change: [-62.285% -62.067% -61.779%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
SIMD SqEuclidean/SimSIMD/5
                        time:   [954.79 ns 955.77 ns 956.94 ns]
                        change: [-0.1110% +0.1493% +0.4162%] (p = 0.26 > 0.05)
                        No change in performance detected.
SIMD SqEuclidean/Rust Native/5
                        time:   [366.50 ns 366.84 ns 367.23 ns]
                        change: [-62.811% -62.688% -62.566%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe
ChillFish8 commented 7 months ago

The compiler command being ran compiling the C code is:

"cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m64" "-I" "include" "-O3" "-std=c99" "-pedantic" "-DSIMSIMD_NATIVE_F16=0" "-DSIMSIMD_DYNAMIC_DISPATCH=1" "-DSIMSIMD_TARGET_SAPPHIRE=0" "-o" "/home/personal/simsimd/target/release/build/simsimd-be318405a648c44f/out/c/lib.o" "-c" "c/lib.c"
ChillFish8 commented 7 months ago

If we tell the compiler that avx2 and fma can be targetted, we get an even faster version of the native Rust code, but no effect on the C side:

RUSTFLAGS="-C target-feature=+avx2,+fma" cargo bench -- --nocapture
SIMD Cosine/SimSIMD/0   time:   [981.74 ns 983.39 ns 985.48 ns]
                        change: [-1.5396% -1.2668% -0.9837%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe
SIMD Cosine/Rust Native/0
                        time:   [130.86 ns 130.95 ns 131.06 ns]
                        change: [-82.739% -82.683% -82.640%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
SIMD Cosine/SimSIMD/1   time:   [983.62 ns 985.05 ns 987.02 ns]
                        change: [-0.5092% -0.3685% -0.2163%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
SIMD Cosine/Rust Native/1
                        time:   [131.07 ns 131.21 ns 131.34 ns]
                        change: [-82.568% -82.529% -82.498%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  6 (6.00%) low severe
  9 (9.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/SimSIMD/2   time:   [981.05 ns 982.28 ns 983.70 ns]
                        change: [-1.0060% -0.8903% -0.7706%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
SIMD Cosine/Rust Native/2
                        time:   [131.01 ns 131.09 ns 131.17 ns]
                        change: [-82.575% -82.548% -82.516%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
SIMD Cosine/SimSIMD/3   time:   [980.46 ns 981.49 ns 982.76 ns]
                        change: [-0.2110% -0.0435% +0.1324%] (p = 0.64 > 0.05)
                        No change in performance detected.
SIMD Cosine/Rust Native/3
                        time:   [130.89 ns 131.03 ns 131.24 ns]
                        change: [-82.550% -82.529% -82.510%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
SIMD Cosine/SimSIMD/4   time:   [978.19 ns 978.80 ns 979.51 ns]
                        change: [-1.2474% -1.1591% -1.0734%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
SIMD Cosine/Rust Native/4
                        time:   [131.07 ns 131.18 ns 131.28 ns]
                        change: [-82.580% -82.562% -82.546%] (p = 0.00 < 0.05)
                        Performance has improved.
SIMD Cosine/SimSIMD/5   time:   [982.41 ns 982.88 ns 983.39 ns]
                        change: [-0.9772% -0.8781% -0.7844%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD Cosine/Rust Native/5
                        time:   [132.08 ns 132.25 ns 132.44 ns]
                        change: [-82.460% -82.416% -82.372%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

     Running rust/benches/sqeuclidean.rs (target/release/deps/sqeuclidean-789b6d1bba04e87b)
Gnuplot not found, using plotters backend
SIMD SqEuclidean/SimSIMD/0
                        time:   [953.51 ns 955.58 ns 957.60 ns]
                        change: [-0.6461% -0.4286% -0.2139%] (p = 0.00 < 0.05)
                        Change within noise threshold.
SIMD SqEuclidean/Rust Native/0
                        time:   [117.68 ns 120.45 ns 123.87 ns]
                        change: [-68.098% -67.815% -67.421%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe
SIMD SqEuclidean/SimSIMD/1
                        time:   [955.73 ns 963.38 ns 973.22 ns]
                        change: [+0.4900% +0.8694% +1.4353%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
SIMD SqEuclidean/Rust Native/1
                        time:   [116.90 ns 117.05 ns 117.22 ns]
                        change: [-67.916% -67.849% -67.782%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/SimSIMD/2
                        time:   [948.83 ns 949.71 ns 950.67 ns]
                        change: [+0.2005% +0.3694% +0.5291%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/2
                        time:   [117.09 ns 117.52 ns 117.91 ns]
                        change: [-68.257% -68.178% -68.101%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/SimSIMD/3
                        time:   [965.79 ns 968.94 ns 972.52 ns]
                        change: [+1.0966% +1.6960% +2.2373%] (p = 0.00 < 0.05)
                        Performance has regressed.
SIMD SqEuclidean/Rust Native/3
                        time:   [118.14 ns 118.67 ns 119.21 ns]
                        change: [-68.157% -68.036% -67.887%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
SIMD SqEuclidean/SimSIMD/4
                        time:   [959.39 ns 962.01 ns 965.08 ns]
                        change: [+1.2580% +1.6979% +2.1558%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/Rust Native/4
                        time:   [116.25 ns 116.36 ns 116.47 ns]
                        change: [-68.894% -68.668% -68.507%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
SIMD SqEuclidean/SimSIMD/5
                        time:   [948.41 ns 949.47 ns 950.65 ns]
                        change: [-1.5866% -1.3651% -1.1355%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe
SIMD SqEuclidean/Rust Native/5
                        time:   [116.15 ns 116.26 ns 116.38 ns]
                        change: [-68.397% -68.363% -68.331%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
ashvardanian commented 6 months ago

Is that all still on the same Ryzen CPU, @ChillFish8?

I was just refreshing the ParallelReductionsBenchmark and added a loop-unrolled variant with scalar code in the C++ layer. It still looses to SIMD even for f32:

$ build_release/reduce_bench
You did not feed the size of arrays, so we will use a 1GB array!
2024-05-06T00:11:14+00:00
Running build_release/reduce_bench
Run on (160 X 2100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x160)
  L1 Instruction 32 KiB (x160)
  L2 Unified 4096 KiB (x80)
  L3 Unified 16384 KiB (x2)
Load Average: 3.23, 19.01, 13.71
----------------------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------
unrolled<f32>/min_time:10.000/real_time                149618549 ns    149615366 ns           95 bytes/s=7.17653G/s error,%=50
unrolled<f64>/min_time:10.000/real_time                146594731 ns    146593719 ns           95 bytes/s=7.32456G/s error,%=0
avx2<f32>/min_time:10.000/real_time                    110796474 ns    110794861 ns          127 bytes/s=9.69112G/s error,%=50
avx2<f32kahan>/min_time:10.000/real_time               134144762 ns    134137771 ns          105 bytes/s=8.00435G/s error,%=0
avx2<f64>/min_time:10.000/real_time                    115791797 ns    115790878 ns          121 bytes/s=9.27304G/s error,%=0

You can find more results in that repos README.

ChillFish8 commented 6 months ago

Hey, yes but it is worth noting in my last comment what is happening under the hood, is LLVM is autovectorizing that loop and using FMA instructions because it's been allowed to assume AVX2 and FMA support.

ashvardanian commented 3 months ago

I believe this is related to #148 and can be improved with the next PR 🤗

ashvardanian commented 2 months ago

Hey, @ChillFish8! Are you observing the same performance issues with the most recent 5.0.1 release as well?

ChillFish8 commented 2 months ago

I can add it back to our benchmarks and give it a test, will let you know shortly

ChillFish8 commented 2 months ago

Adding simsimd back to our benchmarks on the distance functions, it seems better but there is definitely something wrong with f64 types and some overheads going on on f32

Timer precision: 20 ns
bench_distance_ops  fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ cosine                         │               │               │               │         │
│  ├─ cfavml                      │               │               │               │         │
│  │  ├─ f32        151.5 ns      │ 210.6 ns      │ 155.2 ns      │ 162 ns        │ 500     │ 2500000
│  │  │             10.13 Gitem/s │ 7.29 Gitem/s  │ 9.896 Gitem/s │ 9.476 Gitem/s │         │
│  │  ╰─ f64        282.9 ns      │ 290.7 ns      │ 285.8 ns      │ 286 ns        │ 500     │ 2500000
│  │                5.428 Gitem/s │ 5.282 Gitem/s │ 5.373 Gitem/s │ 5.369 Gitem/s │         │
│  ├─ ndarray                     │               │               │               │         │
│  │  ├─ f32        382.3 ns      │ 625.2 ns      │ 394.7 ns      │ 396.2 ns      │ 500     │ 2500000
│  │  │             4.017 Gitem/s │ 2.456 Gitem/s │ 3.89 Gitem/s  │ 3.875 Gitem/s │         │
│  │  ╰─ f64        412.1 ns      │ 521.6 ns      │ 423.5 ns      │ 425.8 ns      │ 500     │ 2500000
│  │                3.726 Gitem/s │ 2.944 Gitem/s │ 3.626 Gitem/s │ 3.606 Gitem/s │         │
│  ╰─ simsimd                     │               │               │               │         │
│     ├─ f32        163.5 ns      │ 206.7 ns      │ 166.7 ns      │ 169.4 ns      │ 500     │ 2500000
│     │             9.39 Gitem/s  │ 7.429 Gitem/s │ 9.212 Gitem/s │ 9.063 Gitem/s │         │
│     ╰─ f64        1.004 µs      │ 1.142 µs      │ 1.011 µs      │ 1.013 µs      │ 500     │ 2500000
│                   1.529 Gitem/s │ 1.344 Gitem/s │ 1.519 Gitem/s │ 1.515 Gitem/s │         │
├─ dot_product                    │               │               │               │         │
│  ├─ cfavml                      │               │               │               │         │
│  │  ├─ f32        60.46 ns      │ 65.74 ns      │ 60.87 ns      │ 61.3 ns       │ 500     │ 2500000
│  │  │             25.4 Gitem/s  │ 23.36 Gitem/s │ 25.23 Gitem/s │ 25.05 Gitem/s │         │
│  │  ╰─ f64        158 ns        │ 184 ns        │ 162.7 ns      │ 162.1 ns      │ 500     │ 2500000
│  │                9.719 Gitem/s │ 8.343 Gitem/s │ 9.439 Gitem/s │ 9.471 Gitem/s │         │
│  ├─ ndarray                     │               │               │               │         │
│  │  ├─ f32        68.83 ns      │ 75.31 ns      │ 69.65 ns      │ 69.97 ns      │ 500     │ 2500000
│  │  │             22.31 Gitem/s │ 20.39 Gitem/s │ 22.05 Gitem/s │ 21.95 Gitem/s │         │
│  │  ╰─ f64        170.2 ns      │ 196.5 ns      │ 171.7 ns      │ 172.4 ns      │ 500     │ 2500000
│  │                9.023 Gitem/s │ 7.815 Gitem/s │ 8.94 Gitem/s  │ 8.907 Gitem/s │         │
│  ╰─ simsimd                     │               │               │               │         │
│     ├─ f32        152.5 ns      │ 180.1 ns      │ 153.9 ns      │ 154.2 ns      │ 500     │ 2500000
│     │             10.06 Gitem/s │ 8.525 Gitem/s │ 9.979 Gitem/s │ 9.959 Gitem/s │         │
│     ╰─ f64        960.3 ns      │ 1.007 µs      │ 969.9 ns      │ 970.6 ns      │ 500     │ 2500000
│                   1.599 Gitem/s │ 1.524 Gitem/s │ 1.583 Gitem/s │ 1.582 Gitem/s │         │
╰─ euclidean                      │               │               │               │         │
   ├─ cfavml                      │               │               │               │         │
   │  ├─ f32        55.67 ns      │ 64.6 ns       │ 56.94 ns      │ 57.26 ns      │ 500     │ 2500000
   │  │             27.58 Gitem/s │ 23.77 Gitem/s │ 26.97 Gitem/s │ 26.82 Gitem/s │         │
   │  ╰─ f64        133.4 ns      │ 145.4 ns      │ 138.2 ns      │ 138 ns        │ 500     │ 2500000
   │                11.51 Gitem/s │ 10.56 Gitem/s │ 11.11 Gitem/s │ 11.12 Gitem/s │         │
   ├─ ndarray                     │               │               │               │         │
   │  ├─ f32        224.8 ns      │ 361.5 ns      │ 229.9 ns      │ 232.9 ns      │ 500     │ 2500000
   │  │             6.83 Gitem/s  │ 4.248 Gitem/s │ 6.679 Gitem/s │ 6.593 Gitem/s │         │
   │  ╰─ f64        435.4 ns      │ 506.4 ns      │ 443.8 ns      │ 446.8 ns      │ 500     │ 2500000
   │                3.527 Gitem/s │ 3.032 Gitem/s │ 3.46 Gitem/s  │ 3.437 Gitem/s │         │
   ╰─ simsimd                     │               │               │               │         │
      ├─ f32        154.5 ns      │ 208.2 ns      │ 156.3 ns      │ 158.3 ns      │ 500     │ 2500000
      │             9.94 Gitem/s  │ 7.374 Gitem/s │ 9.823 Gitem/s │ 9.7 Gitem/s   │         │
      ╰─ f64        969.4 ns      │ 1.051 µs      │ 978.8 ns      │ 987.3 ns      │ 500     │ 2500000
                    1.584 Gitem/s │ 1.46 Gitem/s  │ 1.569 Gitem/s │ 1.555 Gitem/s │         │
ChillFish8 commented 2 months ago

On AVX512 Zen4 it behaves effectively as expected:

Timer precision: 9 ns
bench_distance_ops  fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ cosine                         │               │               │               │         │
│  ├─ cfavml                      │               │               │               │         │
│  │  ├─ f32        121 ns        │ 211.2 ns      │ 123.2 ns      │ 125.6 ns      │ 2500    │ 12500000
│  │  │             12.68 Gitem/s │ 7.27 Gitem/s  │ 12.46 Gitem/s │ 12.22 Gitem/s │         │
│  │  ╰─ f64        245.1 ns      │ 267.7 ns      │ 248 ns        │ 248.3 ns      │ 2500    │ 12500000
│  │                6.266 Gitem/s │ 5.737 Gitem/s │ 6.191 Gitem/s │ 6.185 Gitem/s │         │
│  ├─ ndarray                     │               │               │               │         │
│  │  ├─ f32        344.1 ns      │ 366.7 ns      │ 347.6 ns      │ 347.8 ns      │ 2500    │ 12500000
│  │  │             4.463 Gitem/s │ 4.188 Gitem/s │ 4.417 Gitem/s │ 4.416 Gitem/s │         │
│  │  ╰─ f64        369.1 ns      │ 391.6 ns      │ 374.2 ns      │ 374.2 ns      │ 2500    │ 12500000
│  │                4.16 Gitem/s  │ 3.922 Gitem/s │ 4.103 Gitem/s │ 4.103 Gitem/s │         │
│  ╰─ simsimd                     │               │               │               │         │
│     ├─ f32        75.79 ns      │ 95.61 ns      │ 79.98 ns      │ 79.96 ns      │ 2500    │ 12500000
│     │             20.26 Gitem/s │ 16.06 Gitem/s │ 19.2 Gitem/s  │ 19.2 Gitem/s  │         │
│     ╰─ f64        150.2 ns      │ 172.8 ns      │ 154.2 ns      │ 154.2 ns      │ 2500    │ 12500000
│                   10.22 Gitem/s │ 8.885 Gitem/s │ 9.958 Gitem/s │ 9.955 Gitem/s │         │
├─ dot_product                    │               │               │               │         │
│  ├─ cfavml                      │               │               │               │         │
│  │  ├─ f32        55.12 ns      │ 73.42 ns      │ 55.4 ns       │ 55.53 ns      │ 2500    │ 12500000
│  │  │             27.86 Gitem/s │ 20.91 Gitem/s │ 27.72 Gitem/s │ 27.65 Gitem/s │         │
│  │  ╰─ f64        111.5 ns      │ 129.1 ns      │ 112.5 ns      │ 112.4 ns      │ 2500    │ 12500000
│  │                13.76 Gitem/s │ 11.88 Gitem/s │ 13.64 Gitem/s │ 13.65 Gitem/s │         │
│  ├─ ndarray                     │               │               │               │         │
│  │  ├─ f32        58.89 ns      │ 64.83 ns      │ 59.97 ns      │ 60.03 ns      │ 2500    │ 12500000
│  │  │             26.07 Gitem/s │ 23.69 Gitem/s │ 25.61 Gitem/s │ 25.58 Gitem/s │         │
│  │  ╰─ f64        114.9 ns      │ 135.4 ns      │ 116.9 ns      │ 117 ns        │ 2500    │ 12500000
│  │                13.35 Gitem/s │ 11.33 Gitem/s │ 13.12 Gitem/s │ 13.11 Gitem/s │         │
│  ╰─ simsimd                     │               │               │               │         │
│     ├─ f32        65.18 ns      │ 70.49 ns      │ 66.21 ns      │ 66.25 ns      │ 2500    │ 12500000
│     │             23.56 Gitem/s │ 21.78 Gitem/s │ 23.19 Gitem/s │ 23.18 Gitem/s │         │
│     ╰─ f64        140.8 ns      │ 157.9 ns      │ 144.5 ns      │ 144.7 ns      │ 2500    │ 12500000
│                   10.9 Gitem/s  │ 9.722 Gitem/s │ 10.62 Gitem/s │ 10.6 Gitem/s  │         │
╰─ euclidean                      │               │               │               │         │
   ├─ cfavml                      │               │               │               │         │
   │  ├─ f32        51.01 ns      │ 69.38 ns      │ 51.82 ns      │ 51.9 ns       │ 2500    │ 12500000
   │  │             30.1 Gitem/s  │ 22.13 Gitem/s │ 29.63 Gitem/s │ 29.59 Gitem/s │         │
   │  ╰─ f64        101.6 ns      │ 119.2 ns      │ 103.3 ns      │ 103.4 ns      │ 2500    │ 12500000
   │                15.1 Gitem/s  │ 12.88 Gitem/s │ 14.85 Gitem/s │ 14.84 Gitem/s │         │
   ├─ ndarray                     │               │               │               │         │
   │  ├─ f32        189.8 ns      │ 213.9 ns      │ 196.5 ns      │ 196.5 ns      │ 2500    │ 12500000
   │  │             8.09 Gitem/s  │ 7.177 Gitem/s │ 7.815 Gitem/s │ 7.814 Gitem/s │         │
   │  ╰─ f64        328 ns        │ 346.5 ns      │ 330.4 ns      │ 330.7 ns      │ 2500    │ 12500000
   │                4.681 Gitem/s │ 4.432 Gitem/s │ 4.648 Gitem/s │ 4.643 Gitem/s │         │
   ╰─ simsimd                     │               │               │               │         │
      ├─ f32        69.1 ns       │ 87.75 ns      │ 70.22 ns      │ 70.25 ns      │ 2500    │ 12500000
      │             22.22 Gitem/s │ 17.5 Gitem/s  │ 21.87 Gitem/s │ 21.86 Gitem/s │         │
      ╰─ f64        146.7 ns      │ 166 ns        │ 149.4 ns      │ 149.5 ns      │ 2500    │ 12500000
                    10.46 Gitem/s │ 9.25 Gitem/s  │ 10.27 Gitem/s │ 10.27 Gitem/s │         │
ashvardanian commented 2 months ago

Which machine are these numbers coming from? Is that an Arm machine? Is there SVE available?

ChillFish8 commented 2 months ago

They are on a Ryzen Zen3 chip

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 33
model name      : AMD Ryzen 9 5900X 12-Core Processor
stepping        : 2
microcode       : 0xa201204
cpu MHz         : 2874.313
cache size      : 512 KB
physical id     : 0
siblings        : 24
core id         : 0
cpu cores       : 12
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips        : 7400.03
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
ChillFish8 commented 2 months ago

I'm not sure if it is any help, but the behaviour the f64 implementation is showing seems to mimic that of when there is a target feature missing. Where LLVM is effectively mimicking the intrinsic call rather than actually calling the right instruction.

ashvardanian commented 2 months ago

In some cases, on older AMD CPUs, the latency of some instructions was too high and the compilers preferred using serial code. I think for now we can close this issue, but it's good to keep those differences in mind for future benchmarks. Thank you, @ChillFish8!

ChillFish8 commented 2 months ago

While I think that assumption is wrong, ultimately it is your choice. I think regardless though it may be worth making a note of this performance footgun in the library. As for generally speaking, this library becomes unusable for anyone running on most AMD server hardware and likely any other CPU using AVX2 and FMA only (AWS and GCP general compute instances for example)