Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
This commit add new capability levels for Arm allowing us to differentiate f16, bf16. and i8-supporting generations of CPUs, becoming increasingly popular in the datacenter. Similar to speedups on AMD Genoa, on Arm Graviton3 the bf16 kernels perform very well:
Arm supports 2x2 matrix multiplications for i8 and bf16. All of our initial attempts with @eknag to use them for faster cosine computations for different length vectors have failed. Old measurements:
This commit add new capability levels for Arm allowing us to differentiate
f16
,bf16
. andi8
-supporting generations of CPUs, becoming increasingly popular in the datacenter. Similar to speedups on AMD Genoa, on Arm Graviton3 thebf16
kernels perform very well:The
bf16
kernels reach 33 GB/s as opposed to 19 GB/s forf16
:Research MMLA Extensions
Arm supports 2x2 matrix multiplications for
i8
andbf16
. All of our initial attempts with @eknag to use them for faster cosine computations for different length vectors have failed. Old measurements:Attempts with
i8
for different dimensionality vectors: