Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, C, and Swift, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐
This commit add new capability levels for Arm allowing us to differentiate f16, bf16. and i8-supporting generations of CPUs, becoming increasingly popular in the datacenter. Similar to speedups on AMD Genoa, on Arm Graviton3 the bf16 kernels perform very well:
Arm supports 2x2 matrix multiplications for i8 and bf16. All of our initial attempts with @eknag to use them for faster cosine computations for different length vectors have failed. Old measurements:
This commit add new capability levels for Arm allowing us to differentiate
f16
,bf16
. andi8
-supporting generations of CPUs, becoming increasingly popular in the datacenter. Similar to speedups on AMD Genoa, on Arm Graviton3 thebf16
kernels perform very well:The
bf16
kernels reach 33 GB/s as opposed to 19 GB/s forf16
:Research MMLA Extensions
Arm supports 2x2 matrix multiplications for
i8
andbf16
. All of our initial attempts with @eknag to use them for faster cosine computations for different length vectors have failed. Old measurements:Attempts with
i8
for different dimensionality vectors: