Harley Seal AVX-512 implementations

Binary representations are becoming increasingly popular in Machine Learning and I'd love to explore the opportunity for faster Hamming and Jaccard distance calculations. I've looked into several benchmarks, most importantly the WojciechMula/sse-popcount library, that compares several optimizations for population-counts -the most expensive part of the Hamming/Jaccard kernel.

Extensive benchmarks and the design itself suggest that AVX-512 Harley Seal variant should be the fastest on long inputs beyond 1 KB. Here is a sample of the most recent results obtained on an i3 Cannonlake Intel CPU:

procedure	32 B	64 B	128 B	256 B	512 B	1024 B	2048 B	4096 B
lookup-8	1.19464	1.09949	1.21245	1.11428	1.69827	1.65605	1.63299	1.62148
lookup-64	1.16739	1.09284	1.19636	1.10018	1.69524	1.65319	1.63670	1.62359
harley-seal	1.00883	0.82805	0.51017	0.39659	0.54067	0.49312	0.46917	0.45787
avx2-lookup	0.45543	0.28456	0.20674	0.14150	0.18920	0.16951	0.15977	0.15527
avx2-lookup-original	1.53184	0.90269	0.61849	0.41858	0.34503	0.32416	0.23073	0.25976
avx2-harley-seal	1.03679	0.59198	0.37492	0.26418	0.20457	0.15556	0.13097	0.11904
avx512-harley-seal	3.36585	0.71542	0.40990	0.26028	0.29072	0.10719	0.07310	0.05560
avx512bw-shuf	2.56808	1.99008	1.04359	0.55736	0.48551	0.25119	0.20256	0.15851
avx512vbmi-shuf	2.51702	1.99085	1.09241	0.54717	0.49385	0.25181	0.20032	0.15249
builtin-popcnt	0.22182	0.28289	0.26755	0.31640	0.39424	0.38940	0.36062	0.33525
builtin-popcnt32	0.46220	0.46701	0.51513	0.59160	0.89925	0.85613	0.84084	0.84065
builtin-popcnt-unrolled	0.25161	0.17290	0.14147	0.12966	0.20433	0.22086	0.20939	0.20628
builtin-popcnt-movdq	0.21983	0.18868	0.17849	0.18037	0.34305	0.31526	0.29713	0.29047

I've tried copying the best solution into SimSIMD benchmarking suite and sadly didn't achieve similar improvements on more recent CPUs. On Intel Sapphire Rapids CPUs:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
hamming_b8_haswell_4096b/min_time:10.000/threads:1       50.3 ns         50.3 ns    277340752 abs_delta=0 bytes=162.807G/s pairs=19.8739M/s relative_error=0
hamming_b8_ice_4096b/min_time:10.000/threads:1           34.8 ns         34.8 ns    402233197 abs_delta=0 bytes=235.632G/s pairs=28.7636M/s relative_error=0
hamming_b8_icehs_4096b/min_time:10.000/threads:1         42.4 ns         42.4 ns    330077077 abs_delta=0 bytes=193.07G/s pairs=23.5681M/s relative_error=0

On AMD Genoa:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
hamming_b8_haswell_4096b/min_time:10.000/threads:1       40.5 ns         40.5 ns    346163289 abs_delta=0 bytes=202.502G/s pairs=24.7195M/s relative_error=0
hamming_b8_ice_4096b/min_time:10.000/threads:1           40.6 ns         40.6 ns    344646420 abs_delta=0 bytes=201.733G/s pairs=24.6257M/s relative_error=0
hamming_b8_icehs_4096b/min_time:10.000/threads:1         59.8 ns         59.8 ns    234058579 abs_delta=0 bytes=136.96G/s pairs=16.7188M/s relative_error=0

The kernel designed for Haswell simply uses _mm_popcnt_u64.
The kernel designed for Ice Lake uses _mm512_popcnt_epi64.
The icehs is an adaptation of the Harley Seal transform that "zip"-s two input streams with xor.

To reproduce the results:

cmake -DCMAKE_BUILD_TYPE=Release -DSIMSIMD_BUILD_TESTS=1 -DSIMSIMD_BUILD_BENCHMARKS=1 -DSIMSIMD_BUILD_BENCHMARKS_WITH_CBLAS=1 -B build_release
cmake --build build_release --config Release && build_release/simsimd_bench --benchmark_filter="hamming(.*)4096b"

Please let me know if there is a better way to accelerate this kernel 🤗

ashvardanian / SimSIMD

Harley Seal AVX-512 implementations #138