Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, C, and Swift, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐
Binary representations are becoming increasingly popular in Machine Learning and I'd love to explore the opportunity for faster Hamming and Jaccard distance calculations. I've looked into several benchmarks, most importantly the WojciechMula/sse-popcount library, that compares several optimizations for population-counts -the most expensive part of the Hamming/Jaccard kernel.
Extensive benchmarks and the design itself suggest that AVX-512 Harley Seal variant should be the fastest on long inputs beyond 1 KB. Here is a sample of the most recent results obtained on an i3 Cannonlake Intel CPU:
procedure
32 B
64 B
128 B
256 B
512 B
1024 B
2048 B
4096 B
lookup-8
1.19464
1.09949
1.21245
1.11428
1.69827
1.65605
1.63299
1.62148
lookup-64
1.16739
1.09284
1.19636
1.10018
1.69524
1.65319
1.63670
1.62359
harley-seal
1.00883
0.82805
0.51017
0.39659
0.54067
0.49312
0.46917
0.45787
avx2-lookup
0.45543
0.28456
0.20674
0.14150
0.18920
0.16951
0.15977
0.15527
avx2-lookup-original
1.53184
0.90269
0.61849
0.41858
0.34503
0.32416
0.23073
0.25976
avx2-harley-seal
1.03679
0.59198
0.37492
0.26418
0.20457
0.15556
0.13097
0.11904
avx512-harley-seal
3.36585
0.71542
0.40990
0.26028
0.29072
0.10719
0.07310
0.05560
avx512bw-shuf
2.56808
1.99008
1.04359
0.55736
0.48551
0.25119
0.20256
0.15851
avx512vbmi-shuf
2.51702
1.99085
1.09241
0.54717
0.49385
0.25181
0.20032
0.15249
builtin-popcnt
0.22182
0.28289
0.26755
0.31640
0.39424
0.38940
0.36062
0.33525
builtin-popcnt32
0.46220
0.46701
0.51513
0.59160
0.89925
0.85613
0.84084
0.84065
builtin-popcnt-unrolled
0.25161
0.17290
0.14147
0.12966
0.20433
0.22086
0.20939
0.20628
builtin-popcnt-movdq
0.21983
0.18868
0.17849
0.18037
0.34305
0.31526
0.29713
0.29047
I've tried copying the best solution into SimSIMD benchmarking suite and sadly didn't achieve similar improvements on more recent CPUs. On Intel Sapphire Rapids CPUs:
Binary representations are becoming increasingly popular in Machine Learning and I'd love to explore the opportunity for faster Hamming and Jaccard distance calculations. I've looked into several benchmarks, most importantly the
WojciechMula/sse-popcount
library, that compares several optimizations for population-counts -the most expensive part of the Hamming/Jaccard kernel.Extensive benchmarks and the design itself suggest that AVX-512 Harley Seal variant should be the fastest on long inputs beyond 1 KB. Here is a sample of the most recent results obtained on an i3 Cannonlake Intel CPU:
I've tried copying the best solution into SimSIMD benchmarking suite and sadly didn't achieve similar improvements on more recent CPUs. On Intel Sapphire Rapids CPUs:
On AMD Genoa:
_mm_popcnt_u64
._mm512_popcnt_epi64
.icehs
is an adaptation of the Harley Seal transform that "zip"-s two input streams withxor
.To reproduce the results:
Please let me know if there is a better way to accelerate this kernel 🤗