ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
988 stars 59 forks source link

New `bf16` capabilities on Arm 🦾 #147

Closed ashvardanian closed 3 months ago

ashvardanian commented 4 months ago

This commit add new capability levels for Arm allowing us to differentiate f16, bf16. and i8-supporting generations of CPUs, becoming increasingly popular in the datacenter. Similar to speedups on AMD Genoa, on Arm Graviton3 the bf16 kernels perform very well:

dot_bf16_neon_1536d/min_time:10.000/threads:1            183 ns          183 ns     76204478 abs_delta=0 bytes=33.5194G/s pairs=5.45563M/s relative_error=0
cos_bf16_neon_1536d/min_time:10.000/threads:1            239 ns          239 ns     58180403 abs_delta=0 bytes=25.7056G/s pairs=4.18386M/s relative_error=0
l2sq_bf16_neon_1536d/min_time:10.000/threads:1           312 ns          312 ns     43724273 abs_delta=0 bytes=19.7064G/s pairs=3.20742M/s relative_error=0

The bf16 kernels reach 33 GB/s as opposed to 19 GB/s for f16:

dot_f16_neon_1536d/min_time:10.000/threads:1             323 ns          323 ns     43311367 abs_delta=82.3015n bytes=19.0324G/s pairs=3.09772M/s relative_error=109.717n
cos_f16_neon_1536d/min_time:10.000/threads:1             367 ns          367 ns     38007895 abs_delta=1.5456m bytes=16.7349G/s pairs=2.72377M/s relative_error=6.19568m
l2sq_f16_neon_1536d/min_time:10.000/threads:1            341 ns          341 ns     41010555 abs_delta=66.7783n bytes=18.0436G/s pairs=2.93679M/s relative_error=133.449n

Research MMLA Extensions

Arm supports 2x2 matrix multiplications for i8 and bf16. All of our initial attempts with @eknag to use them for faster cosine computations for different length vectors have failed. Old measurements:

cos_i8_neon_16d/min_time:10.000/threads:1       5.41 ns         5.41 ns   1000000000 abs_delta=910.184u bytes=5.91441G/s pairs=184.825M/s relative_error=4.20295m
cos_i8_neon_64d/min_time:10.000/threads:1       7.63 ns         7.63 ns   1000000000 abs_delta=939.825u bytes=16.7729G/s pairs=131.039M/s relative_error=3.82144m
cos_i8_neon_1536d/min_time:10.000/threads:1        101 ns          101 ns    139085845 abs_delta=917.35u bytes=30.394G/s pairs=9.89387M/s relative_error=3.63925m

Attempts with i8 for different dimensionality vectors:

cos_i8_neon_16d/min_time:10.000/threads:1       5.72 ns         5.72 ns   1000000000 abs_delta=0.282084 bytes=5.59562G/s pairs=174.863M/s relative_error=1.15086
cos_i8_neon_64d/min_time:10.000/threads:1       8.40 ns         8.40 ns   1000000000 abs_delta=0.234385 bytes=15.2345G/s pairs=119.02M/s relative_error=0.923009
cos_i8_neon_1536d/min_time:10.000/threads:1        117 ns          117 ns    118998604 abs_delta=0.23264 bytes=26.2707G/s pairs=8.55167M/s relative_error=0.920099