Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
SimSIMD is expanding and becoming closer to a fully-fledged BLAS library. BLAS level 1 for now, but it's a start! SimSIMD will prioritize mixed and low-precision vector math, favoring modern AI workloads. For image & media processing workloads, the new fma and wsum kernels approach 65 GB/s per core on Intel Sapphire Rapids. That's 100x faster than the serial code for u8 inputs with f32 scaling and accumulation.
SimSIMD is expanding and becoming closer to a fully-fledged BLAS library. BLAS level 1 for now, but it's a start! SimSIMD will prioritize mixed and low-precision vector math, favoring modern AI workloads. For image & media processing workloads, the new
fma
andwsum
kernels approach 65 GB/s per core on Intel Sapphire Rapids. That's 100x faster than the serial code foru8
inputs withf32
scaling and accumulation.Contains the following element-wise operations:
In NumPy terms:
This tiny set of operations is enough to implement a wide range of algorithms:
Benchmarks
On Intel Sapphire Rapids: