ashvardanian / SimSIMD

Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, C, and Swift, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
794 stars 42 forks source link

Harley Seal AVX-512 implementations #138

Open ashvardanian opened 3 weeks ago

ashvardanian commented 3 weeks ago

Binary representations are becoming increasingly popular in Machine Learning and I'd love to explore the opportunity for faster Hamming and Jaccard distance calculations. I've looked into several benchmarks, most importantly the WojciechMula/sse-popcount library, that compares several optimizations for population-counts -the most expensive part of the Hamming/Jaccard kernel.


Extensive benchmarks and the design itself suggest that AVX-512 Harley Seal variant should be the fastest on long inputs beyond 1 KB. Here is a sample of the most recent results obtained on an i3 Cannonlake Intel CPU:

procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.19464 1.09949 1.21245 1.11428 1.69827 1.65605 1.63299 1.62148
lookup-64 1.16739 1.09284 1.19636 1.10018 1.69524 1.65319 1.63670 1.62359
harley-seal 1.00883 0.82805 0.51017 0.39659 0.54067 0.49312 0.46917 0.45787
avx2-lookup 0.45543 0.28456 0.20674 0.14150 0.18920 0.16951 0.15977 0.15527
avx2-lookup-original 1.53184 0.90269 0.61849 0.41858 0.34503 0.32416 0.23073 0.25976
avx2-harley-seal 1.03679 0.59198 0.37492 0.26418 0.20457 0.15556 0.13097 0.11904
avx512-harley-seal 3.36585 0.71542 0.40990 0.26028 0.29072 0.10719 0.07310 0.05560
avx512bw-shuf 2.56808 1.99008 1.04359 0.55736 0.48551 0.25119 0.20256 0.15851
avx512vbmi-shuf 2.51702 1.99085 1.09241 0.54717 0.49385 0.25181 0.20032 0.15249
builtin-popcnt 0.22182 0.28289 0.26755 0.31640 0.39424 0.38940 0.36062 0.33525
builtin-popcnt32 0.46220 0.46701 0.51513 0.59160 0.89925 0.85613 0.84084 0.84065
builtin-popcnt-unrolled 0.25161 0.17290 0.14147 0.12966 0.20433 0.22086 0.20939 0.20628
builtin-popcnt-movdq 0.21983 0.18868 0.17849 0.18037 0.34305 0.31526 0.29713 0.29047

I've tried copying the best solution into SimSIMD benchmarking suite and sadly didn't achieve similar improvements on more recent CPUs. On Intel Sapphire Rapids CPUs:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
hamming_b8_haswell_4096b/min_time:10.000/threads:1       50.3 ns         50.3 ns    277340752 abs_delta=0 bytes=162.807G/s pairs=19.8739M/s relative_error=0
hamming_b8_ice_4096b/min_time:10.000/threads:1           34.8 ns         34.8 ns    402233197 abs_delta=0 bytes=235.632G/s pairs=28.7636M/s relative_error=0
hamming_b8_icehs_4096b/min_time:10.000/threads:1         42.4 ns         42.4 ns    330077077 abs_delta=0 bytes=193.07G/s pairs=23.5681M/s relative_error=0

On AMD Genoa:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
hamming_b8_haswell_4096b/min_time:10.000/threads:1       40.5 ns         40.5 ns    346163289 abs_delta=0 bytes=202.502G/s pairs=24.7195M/s relative_error=0
hamming_b8_ice_4096b/min_time:10.000/threads:1           40.6 ns         40.6 ns    344646420 abs_delta=0 bytes=201.733G/s pairs=24.6257M/s relative_error=0
hamming_b8_icehs_4096b/min_time:10.000/threads:1         59.8 ns         59.8 ns    234058579 abs_delta=0 bytes=136.96G/s pairs=16.7188M/s relative_error=0

To reproduce the results:

cmake -DCMAKE_BUILD_TYPE=Release -DSIMSIMD_BUILD_TESTS=1 -DSIMSIMD_BUILD_BENCHMARKS=1 -DSIMSIMD_BUILD_BENCHMARKS_WITH_CBLAS=1 -B build_release
cmake --build build_release --config Release && build_release/simsimd_bench --benchmark_filter="hamming(.*)4096b"

Please let me know if there is a better way to accelerate this kernel 🤗