Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
The "brain-float-16" is a popular machine learning format. It's broadly supported in hardware and is very machine-friendly, but software support is still lagging behind - https://github.com/numpy/numpy/issues/19808. Most importantly, low-precision bf16 dot-products are supported by the most recent Zen4-based AMD Genoa CPUs. Those have up-to 96 cores, and just one of those cores is capable of computing 86 GB/s worth of such dot-products.
That's a steep 3x improvement over single-precision FMA throughput we can obtain by simply shifting bf16 left by 16 bits and using _mm256_fmadd_ps intrinsic / vfmadd instruction available since Intel Haswell.
Faster i8 kernels
We can't directly use _mm512_dpbusd_epi32 every time we want to compute a low-precision integer dot-product, as it's asymmetric with respect to the sign of the input arguments:
In the past we would just upcast to 16-bit integers and resort to _mm512_dpwssds_epi32. It is a much more costly multiplication circuit, and, assuming that I avoid loop unrolling, also implies 2x fewer scalars per loop. But for cosine distances there is something simple we can do. Assuming that we multiply the vector by itself, even if a certain vector component is negative, its square will always be positive. So we can avoid the expensive 16-bit operation at least where we compute the vector norms:
New
bf16
kernelsThe "brain-float-16" is a popular machine learning format. It's broadly supported in hardware and is very machine-friendly, but software support is still lagging behind - https://github.com/numpy/numpy/issues/19808. Most importantly, low-precision
bf16
dot-products are supported by the most recent Zen4-based AMD Genoa CPUs. Those have up-to 96 cores, and just one of those cores is capable of computing 86 GB/s worth of such dot-products.That's a steep 3x improvement over single-precision FMA throughput we can obtain by simply shifting
bf16
left by 16 bits and using_mm256_fmadd_ps
intrinsic /vfmadd
instruction available since Intel Haswell.Faster
i8
kernelsWe can't directly use
_mm512_dpbusd_epi32
every time we want to compute a low-precision integer dot-product, as it's asymmetric with respect to the sign of the input arguments:In the past we would just upcast to 16-bit integers and resort to
_mm512_dpwssds_epi32
. It is a much more costly multiplication circuit, and, assuming that I avoid loop unrolling, also implies 2x fewer scalars per loop. But for cosine distances there is something simple we can do. Assuming that we multiply the vector by itself, even if a certain vector component is negative, its square will always be positive. So we can avoid the expensive 16-bit operation at least where we compute the vector norms:On Intel Sapphire Rapids it resulted in a higher single-thread utilization, but didn't lead to improvements on other platforms.
New timings: