ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
988 stars 59 forks source link

Faster binary distances on Arm NEON & SVE #212

Closed ashvardanian closed 1 month ago

ashvardanian commented 1 month ago

The new Jaccard distance implementation went from 22 to 30 GB/s on Graviton 4. Hamming reaches 31 GB/s. The same trick was also used for SVE kernels, but they turn out to be slower with 128-bit registers on Graviton 4.

hamming_b8_neon<1536d>/min_time:10.000/threads:1          100 ns          100 ns    139088302 abs_delta=0 bytes=30.6314G/s pairs=9.97116M/s relative_error=0
jaccard_b8_neon<1536d>/min_time:10.000/threads:1         93.5 ns         93.5 ns    146740193 abs_delta=0 bytes=32.8661G/s pairs=10.6986M/s relative_error=0
hamming_b8_sve<1536d>/min_time:10.000/threads:1           117 ns          117 ns    119884598 abs_delta=0 bytes=26.3256G/s pairs=8.56953M/s relative_error=0
jaccard_b8_sve<1536d>/min_time:10.000/threads:1           134 ns          134 ns    104982386 abs_delta=0 bytes=22.9451G/s pairs=7.46911M/s relative_error=0
hamming_b8_serial<1536d>/min_time:10.000/threads:1        465 ns          465 ns     30104101 abs_delta=0 bytes=6.59952G/s pairs=2.14828M/s relative_error=0
jaccard_b8_serial<1536d>/min_time:10.000/threads:1        824 ns          824 ns     16925041 abs_delta=0 bytes=3.72663G/s pairs=1.2131M/s relative_error=0