Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
On Intel Sapphire Rapids, for l2sq the throughput grows from 21 GB/s to 66 GB/s in AVX2 solution for Haswell and 94 GB/s in AVX-512 for Ice Lake+ CPUs.
The cost of square root computation can be prohibitively high on low-dimensional vectors, so it's recommended to use L2sq where exact distance isn't necessary. Below are the numbers for 3D vectors on Intel Sapphire Rapids. Even on such tiny vectors, for bf16, for example, the Genoa kernels are over 4x faster than serial code - 31 GB/s vs 7 GB/s.
Old Kernels, New Types:
u8
On Intel Sapphire Rapids, for
l2sq
the throughput grows from 21 GB/s to 66 GB/s in AVX2 solution for Haswell and 94 GB/s in AVX-512 for Ice Lake+ CPUs.On Apple M2 Pro:
L2 vs L2sq
How fast can we compute the Euclidean distance in $\mathbb{R}^3$?
Is it much slower than computing the squared Euclidean distance? The answer, easily, 30% slower.
The cost of square root computation can be prohibitively high on low-dimensional vectors, so it's recommended to use L2sq where exact distance isn't necessary. Below are the numbers for 3D vectors on Intel Sapphire Rapids. Even on such tiny vectors, for
bf16
, for example, the Genoa kernels are over 4x faster than serial code - 31 GB/s vs 7 GB/s.