ashvardanian / SimSIMD

Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, C, and Swift, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
794 stars 42 forks source link

Optimize Rust impls #108

Closed ChillFish8 closed 2 months ago

ChillFish8 commented 2 months ago

Related to #107

Optimizes the native implementation in a way that the compiler can actually vectorize the implementations despite the IEE rules.

Although not the simplest, it is more realistic of a 'native' implementation if you are trying to get the maximum speed by going down to intrinsic instructions like AVX.

ashvardanian commented 2 months ago

Hi @ChillFish8! Thanks for your contribution! Indeed, your loop-unrolled variant is much faster than the naive Rust approach, even the procedural code.

     Running rust/benches/cosine.rs (target/release/deps/cosine-e0cccefbe212a606)
Gnuplot not found, using plotters backend
SIMD Cosine/SimSIMD/0   time:   [91.178 ns 91.296 ns 91.444 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  4 (4.00%) high mild
  4 (4.00%) high severe
SIMD Cosine/Rust Procedural/0
                        time:   [793.02 ns 796.96 ns 802.25 ns]
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe
SIMD Cosine/Rust Functional/0
                        time:   [794.70 ns 797.24 ns 801.14 ns]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
SIMD Cosine/Rust Unrolled/0
                        time:   [208.64 ns 209.64 ns 211.12 ns]
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe

I am mostly working on recent CPUs and on Intel Sapphire Rapids SimSIMD currently wins thanks to AVX-512 support. I wouldn't expect much difference for f32 on AVX2-only machines. For other types, it may be noticeable. Maybe it makes sense to add benchmarks for i8, the wins can be very noticeable there 🤗

ashvardanian commented 2 months ago

:tada: This PR is included in version 4.3.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket:

ChillFish8 commented 2 months ago

@ashvardanian Do you have a rough idea what the performance difference is between the Intel rapids AVX512 vs something like an AMD 7700 or Epyc chip? Just curious since I develop mostly on AMD type CPUs which can be a bit difficult to predict how performance goes on intel chipsets.

ashvardanian commented 2 months ago

@ChillFish8 on Zen4 most of AVX512 is available, except for FP16 extensions. Everything except for that should work great.

If you are on Zen3 or older, SimSIMD will use F16C extensions for FMA. They are quite slow, but still much better than serial code for half-precision, as modern compilers can't handle that type well. For single-precision you may not get any gains on older CPUs.

For int8, SimSIMD should work great on both old and new CPUs. That type is often used in heavily-quantized embedding models.