Closed ChillFish8 closed 2 months ago
Hi @ChillFish8! Which version of SimSIMD are you using?
AVX2 for float32
is practically the only SIMD+datatype combo we don't implement, as that's the only one that compilers vectorize well 😆 But your result is still very weird. Do you have a project I can clone and run to reproduce that?
I can't currently give access to the project this is ran on, but I can give a copy of the benchmark file minus some of the custom avx stuff, but realistically it is probably best to just worry about simsimd
vs rust
vs blas
for this issue.
cargo.toml
[package]
name = "benchmark-demo"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
[dev-dependencies]
rand = "0.8.5"
simsimd = "4.2.2"
criterion = { version = "0.5.1", features = ["html_reports"] }
[target.'cfg(unix)'.dev-dependencies]
ndarray = { version = "0.15.6", features = ["blas"] }
blas-src = { version = "0.8", features = ["openblas"] }
openblas-src = { version = "0.10", features = ["cblas", "system"] }
[target.'cfg(not(unix))'.dev-dependencies]
ndarray = "0.15.6"
bench_dot_product.rs
#[cfg(unix)]
extern crate blas_src;
use std::hint::black_box;
use std::time::Duration;
use criterion::{criterion_group, criterion_main, Criterion};
use simsimd::SpatialSimilarity;
fn simsimd_dot(a: &[f32], b: &[f32]) -> f32 {
f32::dot(a, b).unwrap_or_default() as f32
}
fn ndarray_dot(a: &ndarray::Array1<f32>, b: &ndarray::Array1<f32>) -> f32 {
a.dot(b)
}
unsafe fn fallback_dot_product_demo<const DIMS: usize>(
a: &[f32],
b: &[f32],
) -> f32 {
debug_assert_eq!(
b.len(),
DIMS,
"Improper implementation detected, vectors must match constant"
);
debug_assert_eq!(
a.len(),
DIMS,
"Improper implementation detected, vectors must match constant"
);
debug_assert_eq!(
DIMS % 8,
0,
"DIMS must be able to fit entirely into chunks of 8 lanes."
);
let mut i = 0;
// We do this manual unrolling to allow the compiler to vectorize
// the loop and avoid some branching even if we're not doing it explicitly.
// This made a significant difference in benchmarking ~4x
let mut acc1 = 0.0;
let mut acc2 = 0.0;
let mut acc3 = 0.0;
let mut acc4 = 0.0;
let mut acc5 = 0.0;
let mut acc6 = 0.0;
let mut acc7 = 0.0;
let mut acc8 = 0.0;
while i < a.len() {
let a1 = *a.get_unchecked(i);
let a2 = *a.get_unchecked(i + 1);
let a3 = *a.get_unchecked(i + 2);
let a4 = *a.get_unchecked(i + 3);
let a5 = *a.get_unchecked(i + 4);
let a6 = *a.get_unchecked(i + 5);
let a7 = *a.get_unchecked(i + 6);
let a8 = *a.get_unchecked(i + 7);
let b1 = *b.get_unchecked(i);
let b2 = *b.get_unchecked(i + 1);
let b3 = *b.get_unchecked(i + 2);
let b4 = *b.get_unchecked(i + 3);
let b5 = *b.get_unchecked(i + 4);
let b6 = *b.get_unchecked(i + 5);
let b7 = *b.get_unchecked(i + 6);
let b8 = *b.get_unchecked(i + 7);
acc1 = acc1 + (a1 * b1);
acc2 = acc2 + (a2 * b2);
acc3 = acc3 + (a3 * b3);
acc4 = acc4 + (a4 * b4);
acc5 = acc5 + (a5 * b5);
acc6 = acc6 + (a6 * b6);
acc7 = acc7 + (a7 * b7);
acc8 = acc8 + (a8 * b8);
i += 8;
}
acc1 = acc1 + acc2;
acc3 = acc3 + acc4;
acc5 = acc5 + acc6;
acc7 = acc7 + acc8;
acc1 = acc1 + acc3;
acc5 = acc5 + acc7;
acc1 + acc5
}
macro_rules! repeat {
($n:expr, $val:block) => {{
for _ in 0..$n {
black_box($val);
}
}};
}
fn criterion_benchmark(c: &mut Criterion) {
// Hey, this benchmark behaves drastically different if you are on Windows VS unix.
// This is because on unix we do a more realistic benchmark and compare ndarray backed
// by openblas rather than with the standard rust impl.
c.bench_function("dot ndarray 1024 auto", |b| {
use ndarray::Array1;
let mut v1 = Vec::new();
let mut v2 = Vec::new();
for _ in 0..1024 {
v1.push(rand::random());
v2.push(rand::random());
}
let v1 = Array1::from_shape_vec((1024,), v1).unwrap();
let v2 = Array1::from_shape_vec((1024,), v2).unwrap();
b.iter(|| repeat!(1000, { ndarray_dot(black_box(&v1), black_box(&v2)) }))
});
c.bench_function("dot simsimd 1024 auto", |b| {
let mut v1 = Vec::new();
let mut v2 = Vec::new();
for _ in 0..1024 {
v1.push(rand::random());
v2.push(rand::random());
}
b.iter(|| repeat!(1000, { simsimd_dot(black_box(&v1), black_box(&v2)) }))
});
c.bench_function("dot fallback 1024 nofma", |b| {
let mut v1 = Vec::new();
let mut v2 = Vec::new();
for _ in 0..1024 {
v1.push(rand::random());
v2.push(rand::random());
}
b.iter(|| repeat!(1000, {
unsafe { fallback_dot_product_demo::<1024>(black_box(&v1), black_box(&v2)) }
}))
});
}
criterion_group!(
name = benches;
config = Criterion::default()
.measurement_time(Duration::from_secs(60))
.sample_size(500);
targets = criterion_benchmark
);
criterion_main!(benches);
To be more specific, the numbers simsimd
is getting for AVX2 and f32 values seem to be more or less in line with iterating through the two vectors and getting the dot product, but without the compiler being able to correctly vectorize the loop. So maybe the compiler for simsimd is not actually vectorizing the loop fully or at all.
The SimSIMD repository contains Rust benchmarks against native implementations. Maybe they are poorly implemented... Can you try cloning the SimSIMD repository and then running the benchmarks, as described in the CONTRIBUTING.md.
cargo bench
Please check out the main
branch version and the main-dev
. I'd be happy to optimize the kernels further, but I am not sure that is possible. If the issue persists, it might be related to compilation settings 🤗
Using the repo benches, by default I get:
SIMD Cosine/SimSIMD/0 time: [990.33 ns 991.20 ns 992.21 ns]
change: [-0.5469% -0.1196% +0.1941%] (p = 0.62 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
SIMD Cosine/Rust Native/0
time: [997.99 ns 1.0023 µs 1.0066 µs]
change: [+0.8535% +1.1800% +1.5240%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
SIMD Cosine/SimSIMD/1 time: [1.0071 µs 1.0112 µs 1.0159 µs]
change: [-0.5979% -0.0751% +0.4182%] (p = 0.77 > 0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high mild
SIMD Cosine/Rust Native/1
time: [995.26 ns 997.31 ns 999.95 ns]
change: [-3.4249% -2.3587% -1.4896%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
SIMD Cosine/SimSIMD/2 time: [992.49 ns 993.86 ns 995.36 ns]
change: [-0.6670% -0.3172% +0.0164%] (p = 0.07 > 0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
SIMD Cosine/Rust Native/2
time: [999.39 ns 1.0017 µs 1.0040 µs]
change: [+0.8312% +1.0924% +1.3528%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
7 (7.00%) low mild
5 (5.00%) high mild
2 (2.00%) high severe
SIMD Cosine/SimSIMD/3 time: [999.12 ns 1.0029 µs 1.0071 µs]
change: [-0.8765% -0.3084% +0.1971%] (p = 0.28 > 0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
SIMD Cosine/Rust Native/3
time: [995.69 ns 997.72 ns 999.69 ns]
change: [+0.6852% +0.9139% +1.1508%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low severe
4 (4.00%) low mild
5 (5.00%) high mild
SIMD Cosine/SimSIMD/4 time: [989.46 ns 991.39 ns 993.36 ns]
change: [-2.4808% -1.7419% -1.1702%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
SIMD Cosine/Rust Native/4
time: [984.42 ns 985.22 ns 986.16 ns]
change: [-1.9665% -1.4544% -0.9763%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
5 (5.00%) high mild
7 (7.00%) high severe
SIMD Cosine/SimSIMD/5 time: [984.21 ns 985.94 ns 987.71 ns]
change: [-1.6544% -1.1956% -0.8287%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
SIMD Cosine/Rust Native/5
time: [987.03 ns 988.30 ns 989.81 ns]
change: [+1.0143% +1.1866% +1.3575%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
Running rust/benches/sqeuclidean.rs (target/release/deps/sqeuclidean-1c498acee1c38350)
Gnuplot not found, using plotters backend
SIMD SqEuclidean/SimSIMD/0
time: [964.05 ns 967.69 ns 971.67 ns]
change: [-1.6473% -1.2355% -0.8248%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
9 (9.00%) high mild
1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/0
time: [973.53 ns 975.20 ns 977.10 ns]
change: [+186.66% +187.37% +188.16%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
SIMD SqEuclidean/SimSIMD/1
time: [952.89 ns 954.25 ns 955.68 ns]
change: [-2.9500% -2.5561% -2.2074%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/1
time: [973.70 ns 975.53 ns 977.30 ns]
change: [+186.14% +186.69% +187.28%] (p = 0.00 < 0.05)
Performance has regressed.
SIMD SqEuclidean/SimSIMD/2
time: [965.95 ns 968.58 ns 971.30 ns]
change: [-1.8963% -1.5119% -1.1299%] (p = 0.00 < 0.05)
Performance has improved.
SIMD SqEuclidean/Rust Native/2
time: [971.81 ns 973.68 ns 975.83 ns]
change: [+181.90% +183.47% +184.85%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
SIMD SqEuclidean/SimSIMD/3
time: [957.05 ns 958.81 ns 960.71 ns]
change: [-3.2849% -2.8105% -2.3846%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/3
time: [971.49 ns 972.77 ns 974.15 ns]
change: [+177.36% +179.33% +181.00%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) high mild
1 (1.00%) high severe
SIMD SqEuclidean/SimSIMD/4
time: [958.75 ns 962.49 ns 966.77 ns]
change: [-2.8413% -2.4086% -2.0098%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
SIMD SqEuclidean/Rust Native/4
time: [977.67 ns 981.15 ns 984.38 ns]
change: [+183.37% +184.79% +186.12%] (p = 0.00 < 0.05)
Performance has regressed.
SIMD SqEuclidean/SimSIMD/5
time: [957.25 ns 959.29 ns 961.63 ns]
change: [-3.4224% -3.1009% -2.8216%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
SIMD SqEuclidean/Rust Native/5
time: [977.04 ns 979.62 ns 982.15 ns]
change: [+182.34% +184.11% +185.86%] (p = 0.00 < 0.05)
Performance has regressed.
If I use the changes in PR #108 I get the following:
SIMD Cosine/SimSIMD/0 time: [995.61 ns 997.99 ns 1.0008 µs]
change: [+0.1468% +0.4000% +0.6799%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
8 (8.00%) high mild
5 (5.00%) high severe
SIMD Cosine/Rust Native/0
time: [755.37 ns 758.73 ns 764.37 ns]
change: [-24.342% -24.086% -23.766%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
SIMD Cosine/SimSIMD/1 time: [985.11 ns 986.34 ns 987.60 ns]
change: [-2.4883% -2.1633% -1.8513%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
5 (5.00%) high mild
2 (2.00%) high severe
SIMD Cosine/Rust Native/1
time: [752.29 ns 754.33 ns 757.00 ns]
change: [-25.113% -24.900% -24.675%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) high mild
1 (1.00%) high severe
SIMD Cosine/SimSIMD/2 time: [987.52 ns 988.61 ns 989.83 ns]
change: [-0.5561% -0.3441% -0.1497%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
7 (7.00%) high mild
4 (4.00%) high severe
SIMD Cosine/Rust Native/2
time: [751.62 ns 752.32 ns 753.19 ns]
change: [-25.024% -24.896% -24.770%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe
SIMD Cosine/SimSIMD/3 time: [987.02 ns 988.13 ns 989.34 ns]
change: [-1.7928% -1.4180% -1.0880%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
SIMD Cosine/Rust Native/3
time: [751.43 ns 751.82 ns 752.29 ns]
change: [-25.020% -24.925% -24.828%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
SIMD Cosine/SimSIMD/4 time: [989.97 ns 990.71 ns 991.66 ns]
change: [-0.0446% +0.1065% +0.2536%] (p = 0.17 > 0.05)
No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
SIMD Cosine/Rust Native/4
time: [750.46 ns 751.02 ns 751.60 ns]
change: [-23.947% -23.833% -23.728%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) high mild
3 (3.00%) high severe
SIMD Cosine/SimSIMD/5 time: [988.47 ns 989.15 ns 989.97 ns]
change: [+0.4132% +0.5962% +0.7772%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
SIMD Cosine/Rust Native/5
time: [751.42 ns 752.31 ns 753.38 ns]
change: [-24.095% -23.966% -23.843%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
Running rust/benches/sqeuclidean.rs (target/release/deps/sqeuclidean-1c498acee1c38350)
Gnuplot not found, using plotters backend
SIMD SqEuclidean/SimSIMD/0
time: [954.47 ns 956.11 ns 957.70 ns]
change: [-1.1162% -0.7026% -0.3014%] (p = 0.00 < 0.05)
Change within noise threshold.
SIMD SqEuclidean/Rust Native/0
time: [366.84 ns 367.18 ns 367.53 ns]
change: [-62.453% -62.353% -62.261%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
SIMD SqEuclidean/SimSIMD/1
time: [946.73 ns 947.48 ns 948.28 ns]
change: [-0.9722% -0.8084% -0.6503%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
SIMD SqEuclidean/Rust Native/1
time: [365.67 ns 365.83 ns 366.01 ns]
change: [-62.469% -62.396% -62.323%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
4 (4.00%) high mild
3 (3.00%) high severe
SIMD SqEuclidean/SimSIMD/2
time: [947.38 ns 949.31 ns 951.74 ns]
change: [-2.0238% -1.7564% -1.4912%] (p = 0.00 < 0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
9 (9.00%) high mild
4 (4.00%) high severe
SIMD SqEuclidean/Rust Native/2
time: [365.85 ns 366.11 ns 366.40 ns]
change: [-62.605% -62.540% -62.476%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
4 (4.00%) high mild
3 (3.00%) high severe
SIMD SqEuclidean/SimSIMD/3
time: [952.75 ns 954.40 ns 956.08 ns]
change: [-0.7782% -0.3103% +0.1819%] (p = 0.25 > 0.05)
No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
SIMD SqEuclidean/Rust Native/3
time: [367.71 ns 368.50 ns 369.52 ns]
change: [-62.255% -62.179% -62.096%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) high mild
5 (5.00%) high severe
SIMD SqEuclidean/SimSIMD/4
time: [946.24 ns 947.68 ns 949.34 ns]
change: [-1.3054% -0.9476% -0.5958%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
SIMD SqEuclidean/Rust Native/4
time: [368.88 ns 370.15 ns 371.65 ns]
change: [-62.285% -62.067% -61.779%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) high mild
9 (9.00%) high severe
SIMD SqEuclidean/SimSIMD/5
time: [954.79 ns 955.77 ns 956.94 ns]
change: [-0.1110% +0.1493% +0.4162%] (p = 0.26 > 0.05)
No change in performance detected.
SIMD SqEuclidean/Rust Native/5
time: [366.50 ns 366.84 ns 367.23 ns]
change: [-62.811% -62.688% -62.566%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
9 (9.00%) high mild
3 (3.00%) high severe
The compiler command being ran compiling the C code is:
"cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m64" "-I" "include" "-O3" "-std=c99" "-pedantic" "-DSIMSIMD_NATIVE_F16=0" "-DSIMSIMD_DYNAMIC_DISPATCH=1" "-DSIMSIMD_TARGET_SAPPHIRE=0" "-o" "/home/personal/simsimd/target/release/build/simsimd-be318405a648c44f/out/c/lib.o" "-c" "c/lib.c"
If we tell the compiler that avx2
and fma
can be targetted, we get an even faster version of the native Rust code, but no effect on the C side:
RUSTFLAGS="-C target-feature=+avx2,+fma" cargo bench -- --nocapture
SIMD Cosine/SimSIMD/0 time: [981.74 ns 983.39 ns 985.48 ns]
change: [-1.5396% -1.2668% -0.9837%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
4 (4.00%) low mild
5 (5.00%) high mild
6 (6.00%) high severe
SIMD Cosine/Rust Native/0
time: [130.86 ns 130.95 ns 131.06 ns]
change: [-82.739% -82.683% -82.640%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low mild
4 (4.00%) high mild
4 (4.00%) high severe
SIMD Cosine/SimSIMD/1 time: [983.62 ns 985.05 ns 987.02 ns]
change: [-0.5092% -0.3685% -0.2163%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe
SIMD Cosine/Rust Native/1
time: [131.07 ns 131.21 ns 131.34 ns]
change: [-82.568% -82.529% -82.498%] (p = 0.00 < 0.05)
Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
6 (6.00%) low severe
9 (9.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
SIMD Cosine/SimSIMD/2 time: [981.05 ns 982.28 ns 983.70 ns]
change: [-1.0060% -0.8903% -0.7706%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
SIMD Cosine/Rust Native/2
time: [131.01 ns 131.09 ns 131.17 ns]
change: [-82.575% -82.548% -82.516%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
SIMD Cosine/SimSIMD/3 time: [980.46 ns 981.49 ns 982.76 ns]
change: [-0.2110% -0.0435% +0.1324%] (p = 0.64 > 0.05)
No change in performance detected.
SIMD Cosine/Rust Native/3
time: [130.89 ns 131.03 ns 131.24 ns]
change: [-82.550% -82.529% -82.510%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low mild
4 (4.00%) high mild
4 (4.00%) high severe
SIMD Cosine/SimSIMD/4 time: [978.19 ns 978.80 ns 979.51 ns]
change: [-1.2474% -1.1591% -1.0734%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
SIMD Cosine/Rust Native/4
time: [131.07 ns 131.18 ns 131.28 ns]
change: [-82.580% -82.562% -82.546%] (p = 0.00 < 0.05)
Performance has improved.
SIMD Cosine/SimSIMD/5 time: [982.41 ns 982.88 ns 983.39 ns]
change: [-0.9772% -0.8781% -0.7844%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
SIMD Cosine/Rust Native/5
time: [132.08 ns 132.25 ns 132.44 ns]
change: [-82.460% -82.416% -82.372%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
Running rust/benches/sqeuclidean.rs (target/release/deps/sqeuclidean-789b6d1bba04e87b)
Gnuplot not found, using plotters backend
SIMD SqEuclidean/SimSIMD/0
time: [953.51 ns 955.58 ns 957.60 ns]
change: [-0.6461% -0.4286% -0.2139%] (p = 0.00 < 0.05)
Change within noise threshold.
SIMD SqEuclidean/Rust Native/0
time: [117.68 ns 120.45 ns 123.87 ns]
change: [-68.098% -67.815% -67.421%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) high mild
7 (7.00%) high severe
SIMD SqEuclidean/SimSIMD/1
time: [955.73 ns 963.38 ns 973.22 ns]
change: [+0.4900% +0.8694% +1.4353%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) high mild
5 (5.00%) high severe
SIMD SqEuclidean/Rust Native/1
time: [116.90 ns 117.05 ns 117.22 ns]
change: [-67.916% -67.849% -67.782%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
SIMD SqEuclidean/SimSIMD/2
time: [948.83 ns 949.71 ns 950.67 ns]
change: [+0.2005% +0.3694% +0.5291%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
SIMD SqEuclidean/Rust Native/2
time: [117.09 ns 117.52 ns 117.91 ns]
change: [-68.257% -68.178% -68.101%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
SIMD SqEuclidean/SimSIMD/3
time: [965.79 ns 968.94 ns 972.52 ns]
change: [+1.0966% +1.6960% +2.2373%] (p = 0.00 < 0.05)
Performance has regressed.
SIMD SqEuclidean/Rust Native/3
time: [118.14 ns 118.67 ns 119.21 ns]
change: [-68.157% -68.036% -67.887%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high mild
SIMD SqEuclidean/SimSIMD/4
time: [959.39 ns 962.01 ns 965.08 ns]
change: [+1.2580% +1.6979% +2.1558%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
SIMD SqEuclidean/Rust Native/4
time: [116.25 ns 116.36 ns 116.47 ns]
change: [-68.894% -68.668% -68.507%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
SIMD SqEuclidean/SimSIMD/5
time: [948.41 ns 949.47 ns 950.65 ns]
change: [-1.5866% -1.3651% -1.1355%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
8 (8.00%) high mild
4 (4.00%) high severe
SIMD SqEuclidean/Rust Native/5
time: [116.15 ns 116.26 ns 116.38 ns]
change: [-68.397% -68.363% -68.331%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) low mild
6 (6.00%) high mild
Is that all still on the same Ryzen CPU, @ChillFish8?
I was just refreshing the ParallelReductionsBenchmark and added a loop-unrolled variant with scalar code in the C++ layer. It still looses to SIMD even for f32
:
$ build_release/reduce_bench
You did not feed the size of arrays, so we will use a 1GB array!
2024-05-06T00:11:14+00:00
Running build_release/reduce_bench
Run on (160 X 2100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x160)
L1 Instruction 32 KiB (x160)
L2 Unified 4096 KiB (x80)
L3 Unified 16384 KiB (x2)
Load Average: 3.23, 19.01, 13.71
----------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------
unrolled<f32>/min_time:10.000/real_time 149618549 ns 149615366 ns 95 bytes/s=7.17653G/s error,%=50
unrolled<f64>/min_time:10.000/real_time 146594731 ns 146593719 ns 95 bytes/s=7.32456G/s error,%=0
avx2<f32>/min_time:10.000/real_time 110796474 ns 110794861 ns 127 bytes/s=9.69112G/s error,%=50
avx2<f32kahan>/min_time:10.000/real_time 134144762 ns 134137771 ns 105 bytes/s=8.00435G/s error,%=0
avx2<f64>/min_time:10.000/real_time 115791797 ns 115790878 ns 121 bytes/s=9.27304G/s error,%=0
You can find more results in that repos README.
Hey, yes but it is worth noting in my last comment what is happening under the hood, is LLVM is autovectorizing that loop and using FMA instructions because it's been allowed to assume AVX2 and FMA support.
I believe this is related to #148 and can be improved with the next PR 🤗
Hey, @ChillFish8! Are you observing the same performance issues with the most recent 5.0.1 release as well?
I can add it back to our benchmarks and give it a test, will let you know shortly
Adding simsimd back to our benchmarks on the distance functions, it seems better but there is definitely something wrong with f64
types and some overheads going on on f32
Timer precision: 20 ns
bench_distance_ops fastest │ slowest │ median │ mean │ samples │ iters
├─ cosine │ │ │ │ │
│ ├─ cfavml │ │ │ │ │
│ │ ├─ f32 151.5 ns │ 210.6 ns │ 155.2 ns │ 162 ns │ 500 │ 2500000
│ │ │ 10.13 Gitem/s │ 7.29 Gitem/s │ 9.896 Gitem/s │ 9.476 Gitem/s │ │
│ │ ╰─ f64 282.9 ns │ 290.7 ns │ 285.8 ns │ 286 ns │ 500 │ 2500000
│ │ 5.428 Gitem/s │ 5.282 Gitem/s │ 5.373 Gitem/s │ 5.369 Gitem/s │ │
│ ├─ ndarray │ │ │ │ │
│ │ ├─ f32 382.3 ns │ 625.2 ns │ 394.7 ns │ 396.2 ns │ 500 │ 2500000
│ │ │ 4.017 Gitem/s │ 2.456 Gitem/s │ 3.89 Gitem/s │ 3.875 Gitem/s │ │
│ │ ╰─ f64 412.1 ns │ 521.6 ns │ 423.5 ns │ 425.8 ns │ 500 │ 2500000
│ │ 3.726 Gitem/s │ 2.944 Gitem/s │ 3.626 Gitem/s │ 3.606 Gitem/s │ │
│ ╰─ simsimd │ │ │ │ │
│ ├─ f32 163.5 ns │ 206.7 ns │ 166.7 ns │ 169.4 ns │ 500 │ 2500000
│ │ 9.39 Gitem/s │ 7.429 Gitem/s │ 9.212 Gitem/s │ 9.063 Gitem/s │ │
│ ╰─ f64 1.004 µs │ 1.142 µs │ 1.011 µs │ 1.013 µs │ 500 │ 2500000
│ 1.529 Gitem/s │ 1.344 Gitem/s │ 1.519 Gitem/s │ 1.515 Gitem/s │ │
├─ dot_product │ │ │ │ │
│ ├─ cfavml │ │ │ │ │
│ │ ├─ f32 60.46 ns │ 65.74 ns │ 60.87 ns │ 61.3 ns │ 500 │ 2500000
│ │ │ 25.4 Gitem/s │ 23.36 Gitem/s │ 25.23 Gitem/s │ 25.05 Gitem/s │ │
│ │ ╰─ f64 158 ns │ 184 ns │ 162.7 ns │ 162.1 ns │ 500 │ 2500000
│ │ 9.719 Gitem/s │ 8.343 Gitem/s │ 9.439 Gitem/s │ 9.471 Gitem/s │ │
│ ├─ ndarray │ │ │ │ │
│ │ ├─ f32 68.83 ns │ 75.31 ns │ 69.65 ns │ 69.97 ns │ 500 │ 2500000
│ │ │ 22.31 Gitem/s │ 20.39 Gitem/s │ 22.05 Gitem/s │ 21.95 Gitem/s │ │
│ │ ╰─ f64 170.2 ns │ 196.5 ns │ 171.7 ns │ 172.4 ns │ 500 │ 2500000
│ │ 9.023 Gitem/s │ 7.815 Gitem/s │ 8.94 Gitem/s │ 8.907 Gitem/s │ │
│ ╰─ simsimd │ │ │ │ │
│ ├─ f32 152.5 ns │ 180.1 ns │ 153.9 ns │ 154.2 ns │ 500 │ 2500000
│ │ 10.06 Gitem/s │ 8.525 Gitem/s │ 9.979 Gitem/s │ 9.959 Gitem/s │ │
│ ╰─ f64 960.3 ns │ 1.007 µs │ 969.9 ns │ 970.6 ns │ 500 │ 2500000
│ 1.599 Gitem/s │ 1.524 Gitem/s │ 1.583 Gitem/s │ 1.582 Gitem/s │ │
╰─ euclidean │ │ │ │ │
├─ cfavml │ │ │ │ │
│ ├─ f32 55.67 ns │ 64.6 ns │ 56.94 ns │ 57.26 ns │ 500 │ 2500000
│ │ 27.58 Gitem/s │ 23.77 Gitem/s │ 26.97 Gitem/s │ 26.82 Gitem/s │ │
│ ╰─ f64 133.4 ns │ 145.4 ns │ 138.2 ns │ 138 ns │ 500 │ 2500000
│ 11.51 Gitem/s │ 10.56 Gitem/s │ 11.11 Gitem/s │ 11.12 Gitem/s │ │
├─ ndarray │ │ │ │ │
│ ├─ f32 224.8 ns │ 361.5 ns │ 229.9 ns │ 232.9 ns │ 500 │ 2500000
│ │ 6.83 Gitem/s │ 4.248 Gitem/s │ 6.679 Gitem/s │ 6.593 Gitem/s │ │
│ ╰─ f64 435.4 ns │ 506.4 ns │ 443.8 ns │ 446.8 ns │ 500 │ 2500000
│ 3.527 Gitem/s │ 3.032 Gitem/s │ 3.46 Gitem/s │ 3.437 Gitem/s │ │
╰─ simsimd │ │ │ │ │
├─ f32 154.5 ns │ 208.2 ns │ 156.3 ns │ 158.3 ns │ 500 │ 2500000
│ 9.94 Gitem/s │ 7.374 Gitem/s │ 9.823 Gitem/s │ 9.7 Gitem/s │ │
╰─ f64 969.4 ns │ 1.051 µs │ 978.8 ns │ 987.3 ns │ 500 │ 2500000
1.584 Gitem/s │ 1.46 Gitem/s │ 1.569 Gitem/s │ 1.555 Gitem/s │ │
On AVX512 Zen4 it behaves effectively as expected:
Timer precision: 9 ns
bench_distance_ops fastest │ slowest │ median │ mean │ samples │ iters
├─ cosine │ │ │ │ │
│ ├─ cfavml │ │ │ │ │
│ │ ├─ f32 121 ns │ 211.2 ns │ 123.2 ns │ 125.6 ns │ 2500 │ 12500000
│ │ │ 12.68 Gitem/s │ 7.27 Gitem/s │ 12.46 Gitem/s │ 12.22 Gitem/s │ │
│ │ ╰─ f64 245.1 ns │ 267.7 ns │ 248 ns │ 248.3 ns │ 2500 │ 12500000
│ │ 6.266 Gitem/s │ 5.737 Gitem/s │ 6.191 Gitem/s │ 6.185 Gitem/s │ │
│ ├─ ndarray │ │ │ │ │
│ │ ├─ f32 344.1 ns │ 366.7 ns │ 347.6 ns │ 347.8 ns │ 2500 │ 12500000
│ │ │ 4.463 Gitem/s │ 4.188 Gitem/s │ 4.417 Gitem/s │ 4.416 Gitem/s │ │
│ │ ╰─ f64 369.1 ns │ 391.6 ns │ 374.2 ns │ 374.2 ns │ 2500 │ 12500000
│ │ 4.16 Gitem/s │ 3.922 Gitem/s │ 4.103 Gitem/s │ 4.103 Gitem/s │ │
│ ╰─ simsimd │ │ │ │ │
│ ├─ f32 75.79 ns │ 95.61 ns │ 79.98 ns │ 79.96 ns │ 2500 │ 12500000
│ │ 20.26 Gitem/s │ 16.06 Gitem/s │ 19.2 Gitem/s │ 19.2 Gitem/s │ │
│ ╰─ f64 150.2 ns │ 172.8 ns │ 154.2 ns │ 154.2 ns │ 2500 │ 12500000
│ 10.22 Gitem/s │ 8.885 Gitem/s │ 9.958 Gitem/s │ 9.955 Gitem/s │ │
├─ dot_product │ │ │ │ │
│ ├─ cfavml │ │ │ │ │
│ │ ├─ f32 55.12 ns │ 73.42 ns │ 55.4 ns │ 55.53 ns │ 2500 │ 12500000
│ │ │ 27.86 Gitem/s │ 20.91 Gitem/s │ 27.72 Gitem/s │ 27.65 Gitem/s │ │
│ │ ╰─ f64 111.5 ns │ 129.1 ns │ 112.5 ns │ 112.4 ns │ 2500 │ 12500000
│ │ 13.76 Gitem/s │ 11.88 Gitem/s │ 13.64 Gitem/s │ 13.65 Gitem/s │ │
│ ├─ ndarray │ │ │ │ │
│ │ ├─ f32 58.89 ns │ 64.83 ns │ 59.97 ns │ 60.03 ns │ 2500 │ 12500000
│ │ │ 26.07 Gitem/s │ 23.69 Gitem/s │ 25.61 Gitem/s │ 25.58 Gitem/s │ │
│ │ ╰─ f64 114.9 ns │ 135.4 ns │ 116.9 ns │ 117 ns │ 2500 │ 12500000
│ │ 13.35 Gitem/s │ 11.33 Gitem/s │ 13.12 Gitem/s │ 13.11 Gitem/s │ │
│ ╰─ simsimd │ │ │ │ │
│ ├─ f32 65.18 ns │ 70.49 ns │ 66.21 ns │ 66.25 ns │ 2500 │ 12500000
│ │ 23.56 Gitem/s │ 21.78 Gitem/s │ 23.19 Gitem/s │ 23.18 Gitem/s │ │
│ ╰─ f64 140.8 ns │ 157.9 ns │ 144.5 ns │ 144.7 ns │ 2500 │ 12500000
│ 10.9 Gitem/s │ 9.722 Gitem/s │ 10.62 Gitem/s │ 10.6 Gitem/s │ │
╰─ euclidean │ │ │ │ │
├─ cfavml │ │ │ │ │
│ ├─ f32 51.01 ns │ 69.38 ns │ 51.82 ns │ 51.9 ns │ 2500 │ 12500000
│ │ 30.1 Gitem/s │ 22.13 Gitem/s │ 29.63 Gitem/s │ 29.59 Gitem/s │ │
│ ╰─ f64 101.6 ns │ 119.2 ns │ 103.3 ns │ 103.4 ns │ 2500 │ 12500000
│ 15.1 Gitem/s │ 12.88 Gitem/s │ 14.85 Gitem/s │ 14.84 Gitem/s │ │
├─ ndarray │ │ │ │ │
│ ├─ f32 189.8 ns │ 213.9 ns │ 196.5 ns │ 196.5 ns │ 2500 │ 12500000
│ │ 8.09 Gitem/s │ 7.177 Gitem/s │ 7.815 Gitem/s │ 7.814 Gitem/s │ │
│ ╰─ f64 328 ns │ 346.5 ns │ 330.4 ns │ 330.7 ns │ 2500 │ 12500000
│ 4.681 Gitem/s │ 4.432 Gitem/s │ 4.648 Gitem/s │ 4.643 Gitem/s │ │
╰─ simsimd │ │ │ │ │
├─ f32 69.1 ns │ 87.75 ns │ 70.22 ns │ 70.25 ns │ 2500 │ 12500000
│ 22.22 Gitem/s │ 17.5 Gitem/s │ 21.87 Gitem/s │ 21.86 Gitem/s │ │
╰─ f64 146.7 ns │ 166 ns │ 149.4 ns │ 149.5 ns │ 2500 │ 12500000
10.46 Gitem/s │ 9.25 Gitem/s │ 10.27 Gitem/s │ 10.27 Gitem/s │ │
Which machine are these numbers coming from? Is that an Arm machine? Is there SVE available?
They are on a Ryzen Zen3 chip
processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 33
model name : AMD Ryzen 9 5900X 12-Core Processor
stepping : 2
microcode : 0xa201204
cpu MHz : 2874.313
cache size : 512 KB
physical id : 0
siblings : 24
core id : 0
cpu cores : 12
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips : 7400.03
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
I'm not sure if it is any help, but the behaviour the f64
implementation is showing seems to mimic that of when there is a target feature missing. Where LLVM is effectively mimicking the intrinsic call rather than actually calling the right instruction.
In some cases, on older AMD CPUs, the latency of some instructions was too high and the compilers preferred using serial code. I think for now we can close this issue, but it's good to keep those differences in mind for future benchmarks. Thank you, @ChillFish8!
While I think that assumption is wrong, ultimately it is your choice. I think regardless though it may be worth making a note of this performance footgun in the library. As for generally speaking, this library becomes unusable for anyone running on most AMD server hardware and likely any other CPU using AVX2 and FMA only (AWS and GCP general compute instances for example)
Recently we've been implementing some spacial distance functions and benchmarking them against some existing libraries, when testing with high dimensional data (1024 dims) we observe
simsimd
taking on average619ns
per vector, compared to ndarray (when backed by openblas) taking43ns
or an optimized bit of pure Rust taking234ns
and95ns
with ffast-math like intrinsics disabled/enabled respectively.These benchmarks are taken with Criterion doing 1,000 vector ops per iteration in order to account for any clock accuracy issues due to the low ns time.
Notes
AMD Ryzen 9 5900X 12-Core Processor, 3701 Mhz, 12 Core(s), 24 Logical Processor(s)
0.5.1
, Openblas0.3.25
RUSTFLAGS="-C target-feature=+avx2,+fma"
RUSTFLAGS="-C target-cpu=native"
Loose benchmark structure (within Criterion)
There is a bit too much code to paste the exact benchmarks, but each step is the following:
Pure Rust impl
Below is a fallback impl I've made, for simplicity I've removed the generic which was used to replace regular math operations with their ffast-math equivalents when running the
dot fallback 1024 fma
benchmark, however, the asm fordot fallback 1024 nofma
are identical.Notes
8
so we don't have an additional loop to do the remainder ifDIMS
were to not be a multiple of8
, that being said, even with that final loop, the difference is minimal.