Closed ChillFish8 closed 2 months ago
Hi @ChillFish8! Thanks for your contribution! Indeed, your loop-unrolled variant is much faster than the naive Rust approach, even the procedural code.
Running rust/benches/cosine.rs (target/release/deps/cosine-e0cccefbe212a606)
Gnuplot not found, using plotters backend
SIMD Cosine/SimSIMD/0 time: [91.178 ns 91.296 ns 91.444 ns]
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low severe
4 (4.00%) high mild
4 (4.00%) high severe
SIMD Cosine/Rust Procedural/0
time: [793.02 ns 796.96 ns 802.25 ns]
Found 17 outliers among 100 measurements (17.00%)
5 (5.00%) high mild
12 (12.00%) high severe
SIMD Cosine/Rust Functional/0
time: [794.70 ns 797.24 ns 801.14 ns]
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe
SIMD Cosine/Rust Unrolled/0
time: [208.64 ns 209.64 ns 211.12 ns]
Found 14 outliers among 100 measurements (14.00%)
5 (5.00%) high mild
9 (9.00%) high severe
I am mostly working on recent CPUs and on Intel Sapphire Rapids SimSIMD currently wins thanks to AVX-512 support. I wouldn't expect much difference for f32
on AVX2-only machines. For other types, it may be noticeable. Maybe it makes sense to add benchmarks for i8
, the wins can be very noticeable there 🤗
:tada: This PR is included in version 4.3.0 :tada:
The release is available on GitHub release
Your semantic-release bot :package::rocket:
@ashvardanian Do you have a rough idea what the performance difference is between the Intel rapids AVX512 vs something like an AMD 7700 or Epyc chip? Just curious since I develop mostly on AMD type CPUs which can be a bit difficult to predict how performance goes on intel chipsets.
@ChillFish8 on Zen4 most of AVX512 is available, except for FP16 extensions. Everything except for that should work great.
If you are on Zen3 or older, SimSIMD will use F16C extensions for FMA. They are quite slow, but still much better than serial code for half-precision, as modern compilers can't handle that type well. For single-precision you may not get any gains on older CPUs.
For int8, SimSIMD should work great on both old and new CPUs. That type is often used in heavily-quantized embedding models.
Related to #107
Optimizes the native implementation in a way that the compiler can actually vectorize the implementations despite the IEE rules.
Although not the simplest, it is more realistic of a 'native' implementation if you are trying to get the maximum speed by going down to intrinsic instructions like AVX.