Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
Chances are - you need fast set intersections! It's one of the most common operations in programming, yet one of the hardest to accelerate with SIMD! This PR improves existing kernels and adds new ones for fast set intersections of sorted arrays of uniqueu16 and u32 values. Now, SimSIMD is not practically the only production codebase to use Arm SVE, but also one of the first to use the new SVE2 instructions available on Graviton 4 AWS CPUs, and coming to Nvidia's Grace Hopper, Microsoft Cobalt, and Google Axios! So upgrade to v5.2 and let's make the databases & search systems go 5x faster!
Implementation Details
Move-mask on Arm
On Arm, sadly intrinsics like vcntq_u32 and vtstq_u32 were useless, but the trick already used in StringZilla to compute the analog of movemask in SSE was very handy:
The svmatch_u16 was used to accelerate intersections when SVE2 is available, but it's not available for 32-bit values.
The naive approach is to use a combination of svcmpeq_u32 and svext_u32 in a loop to check all possible pairs.
// Alternatively, one can use histogram instructions, like `svhistcnt_u32_z`.
// They practically compute the prefix-matching count, which is equivalent to
// the upper triangle of the intersection matrix.
// To compute the lower triangle, we can reverse (with `svrev_b32`) the order of elements
// in one vector and repeat the operation, accumulating the results for top and bottom.
svbool_t equal_mask = svpfalse_b();
for (simsimd_size_t i = 0; i < register_size; i++) {
equal_mask = svorr_z(svptrue_b32(), equal_mask, svcmpeq_u32(a_progress, a_vec, b_vec));
b_vec = svext_u32(b_vec, b_vec, 1);
}
simsimd_size_t equal_count = svcntp_b32(a_progress, equal_mask);
A better opportunity is to use new "histogram" instructions in conjunction with reversals.
They practically compute the prefix-matching count, which is equivalent to the lower triangle of the row-major intersection matrix.
To compute the upper triangle, we can reverse (with svrev_b32) the order of elements and repeat the operation, accumulating the results for top and bottom:
// ⊐ α = {A, B, C, D}, β = {X, Y, Z, W}:
//
// hist(α, β): hist(α_rev, β_rev):
//
// X Y Z W W Z Y X
// A 1 0 0 0 D 1 0 0 0
// B 1 1 0 0 C 1 1 0 0
// C 1 1 1 0 B 1 1 1 0
// D 1 1 1 1 A 1 1 1 1
//
svuint32_t hist_lower = svhistcnt_u32_z(a_progress, a_vec, b_vec);
svuint32_t a_rev_vec = svrev_u32(a_vec);
svuint32_t b_rev_vec = svrev_u32(b_vec);
svuint32_t hist_upper = svrev_u32(svhistcnt_u32_z(svptrue_b32(), a_rev_vec, b_rev_vec));
svuint32_t hist = svorr_u32_x(a_progress, hist_lower, hist_upper);
svbool_t equal_mask = svcmpne_n_u32(a_progress, hist, 0);
simsimd_size_t equal_count = svcntp_b32(a_progress, equal_mask);
Portable Population Counts
Arm has no good way of computing the population count of a bitset, similar to _mm_popcnt_u32 and _lzcnt_u32 on x86. One approach may be to use compiler intrinsics, like __builtin_popcountll and __builtin_clzll, but those are specific to GCC and Clang. Adding variants for MSVC doesn't help much, as on some platforms MSVC intrinsics are not available. In StringZilla the following wrappers are used:
/* Intrinsics aliases for MSVC, GCC, Clang, and Clang-Cl.
* The following section of compiler intrinsics comes in 2 flavors.
*/
#if defined(_MSC_VER) && !defined(__clang__) // On Clang-CL
#include <intrin.h>
// Sadly, when building Win32 images, we can't use the `_tzcnt_u64`, `_lzcnt_u64`,
// `_BitScanForward64`, or `_BitScanReverse64` intrinsics. For now it's a simple `for`-loop.
#if (defined(_WIN32) && !defined(_WIN64)) || defined(_M_ARM) || defined(_M_ARM64)
SZ_INTERNAL int sz_u64_ctz(sz_u64_t x) {
sz_assert(x != 0);
int n = 0;
while ((x & 1) == 0) { n++, x >>= 1; }
return n;
}
SZ_INTERNAL int sz_u64_clz(sz_u64_t x) {
sz_assert(x != 0);
int n = 0;
while ((x & 0x8000000000000000ull) == 0) { n++, x <<= 1; }
return n;
}
SZ_INTERNAL int sz_u64_popcount(sz_u64_t x) {
x = x - ((x >> 1) & 0x5555555555555555ull);
x = (x & 0x3333333333333333ull) + ((x >> 2) & 0x3333333333333333ull);
return (((x + (x >> 4)) & 0x0F0F0F0F0F0F0F0Full) * 0x0101010101010101ull) >> 56;
}
SZ_INTERNAL int sz_u32_ctz(sz_u32_t x) {
sz_assert(x != 0);
int n = 0;
while ((x & 1) == 0) { n++, x >>= 1; }
return n;
}
SZ_INTERNAL int sz_u32_clz(sz_u32_t x) {
sz_assert(x != 0);
int n = 0;
while ((x & 0x80000000u) == 0) { n++, x <<= 1; }
return n;
}
SZ_INTERNAL int sz_u32_popcount(sz_u32_t x) {
x = x - ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
return (((x + (x >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
#else
SZ_INTERNAL int sz_u64_ctz(sz_u64_t x) { return (int)_tzcnt_u64(x); }
SZ_INTERNAL int sz_u64_clz(sz_u64_t x) { return (int)_lzcnt_u64(x); }
SZ_INTERNAL int sz_u64_popcount(sz_u64_t x) { return (int)__popcnt64(x); }
SZ_INTERNAL int sz_u32_ctz(sz_u32_t x) { return (int)_tzcnt_u32(x); }
SZ_INTERNAL int sz_u32_clz(sz_u32_t x) { return (int)_lzcnt_u32(x); }
SZ_INTERNAL int sz_u32_popcount(sz_u32_t x) { return (int)__popcnt(x); }
#endif
#else
SZ_INTERNAL int sz_u64_popcount(sz_u64_t x) { return __builtin_popcountll(x); }
SZ_INTERNAL int sz_u32_popcount(sz_u32_t x) { return __builtin_popcount(x); }
SZ_INTERNAL int sz_u64_ctz(sz_u64_t x) { return __builtin_ctzll(x); }
SZ_INTERNAL int sz_u64_clz(sz_u64_t x) { return __builtin_clzll(x); }
SZ_INTERNAL int sz_u32_ctz(sz_u32_t x) { return __builtin_ctz(x); } // ! Undefined if `x == 0`
SZ_INTERNAL int sz_u32_clz(sz_u32_t x) { return __builtin_clz(x); } // ! Undefined if `x == 0`
#endif
We can use inline Assembly to invoke the instruction directly. This results in a tiny 2% performance reduction:
But the problem is - MSVC doesn't support inline Assembly 🤦 :
D:\a\SimSIMD\SimSIMD\include\simsimd\sparse.h(408): warning C4013: '__asm__' undefined; assuming extern returning int
D:\a\SimSIMD\SimSIMD\include\simsimd\sparse.h(408): error C2143: syntax error: missing ')' before ':'
We can also avoid population counts on every cycle and only aggregating them in the end:
// Now we are likely to have some overlap, so we can intersect the registers.
// We can do it by performing a population count at every cycle, but it's not the cheapest in terms of cycles.
//
// simsimd_u64_t a_matches = __builtin_popcountll(
// _simsimd_u8_to_u4_neon(vreinterpretq_u8_u16(
// _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8))));
// c += a_matches / 8;
//
// Alternatively, we can we can transform match-masks into "ones", accumulate them between the cycles,
// and merge all together in the end.
uint16x8_t a_matches = _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8);
c_counts_vec.u16x8 = vaddq_u16(c_counts_vec.u16x8, vandq_u16(a_matches, vdupq_n_u16(1)));
One more idea I've tried, was using CLZ on Arm to compute two values at a time:
SIMSIMD_INTERNAL simsimd_u32_t _simsimd_u8_to_b1_neon(uint8x16_t vec) {
simsimd_i8_t const
__attribute__((aligned(16))) shift_table[16] = {-7, -6, -5, -4, -3, -2, -1, 0, -7, -6, -5, -4, -3, -2, -1, 0};
int8x16_t shift_vec = vld1q_s8(shift_table);
uint8x16_t mask_vec = vandq_u8(vec, vdupq_n_u8(0x80));
mask_vec = vshlq_u8(mask_vec, shift_vec);
simsimd_u32_t out = vaddv_u8(vget_low_u8(mask_vec));
out += (vaddv_u8(vget_high_u8(mask_vec)) << 8);
return out;
}
SIMSIMD_INTERNAL uint16x8_t _simsimd_intersect_u16x8_neon(uint16x8_t a, uint16x8_t b) {
uint16x8_t b1 = vextq_u16(b, b, 1);
uint16x8_t b2 = vextq_u16(b, b, 2);
uint16x8_t b3 = vextq_u16(b, b, 3);
uint16x8_t b4 = vextq_u16(b, b, 4);
uint16x8_t b5 = vextq_u16(b, b, 5);
uint16x8_t b6 = vextq_u16(b, b, 6);
uint16x8_t b7 = vextq_u16(b, b, 7);
uint16x8_t nm00 = vceqq_u16(a, b);
uint16x8_t nm01 = vceqq_u16(a, b1);
uint16x8_t nm02 = vceqq_u16(a, b2);
uint16x8_t nm03 = vceqq_u16(a, b3);
uint16x8_t nm04 = vceqq_u16(a, b4);
uint16x8_t nm05 = vceqq_u16(a, b5);
uint16x8_t nm06 = vceqq_u16(a, b6);
uint16x8_t nm07 = vceqq_u16(a, b7);
uint16x8_t nm = vorrq_u16(vorrq_u16(vorrq_u16(nm00, nm01), vorrq_u16(nm02, nm03)),
vorrq_u16(vorrq_u16(nm04, nm05), vorrq_u16(nm06, nm07)));
return nm;
}
SIMSIMD_PUBLIC void simsimd_intersect_u16_neon(simsimd_u16_t const* a, simsimd_u16_t const* b, simsimd_size_t a_length,
simsimd_size_t b_length, simsimd_distance_t* results) {
// The baseline implementation for very small arrays (2 registers or less) can be quite simple:
if (a_length < 32 && b_length < 32) {
simsimd_intersect_u16_serial(a, b, a_length, b_length, results);
return;
}
simsimd_u16_t const* const a_end = a + a_length;
simsimd_u16_t const* const b_end = b + b_length;
union vec_t {
uint16x8_t u16x8;
simsimd_u16_t u16[8];
simsimd_u8_t u8[16];
} a_vec, b_vec, c_counts_vec;
c_counts_vec.u16x8 = vdupq_n_u16(0);
while (a + 8 < a_end && b + 8 < b_end) {
a_vec.u16x8 = vld1q_u16(a);
b_vec.u16x8 = vld1q_u16(b);
// Intersecting registers with `_simsimd_intersect_u16x8_neon` involves a lot of shuffling
// and comparisons, so we want to avoid it if the slices don't overlap at all..
simsimd_u16_t a_min;
simsimd_u16_t a_max = a_vec.u16[7];
simsimd_u16_t b_min = b_vec.u16[0];
simsimd_u16_t b_max = b_vec.u16[7];
// If the slices don't overlap, advance the appropriate pointer
while (a_max < b_min && a + 16 < a_end) {
a += 8;
a_vec.u16x8 = vld1q_u16(a);
a_max = a_vec.u16[7];
}
a_min = a_vec.u16[0];
while (b_max < a_min && b + 16 < b_end) {
b += 8;
b_vec.u16x8 = vld1q_u16(b);
b_max = b_vec.u16[7];
}
b_min = b_vec.u16[0];
// Now we are likely to have some overlap, so we can intersect the registers.
// We can do it by performing a population count at every cycle, but it's not the cheapest in terms of cycles.
//
// simsimd_u64_t a_matches = __builtin_popcountll(
// _simsimd_u8_to_u4_neon(vreinterpretq_u8_u16(
// _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8))));
// c += a_matches / 8;
//
// Alternatively, we can we can transform match-masks into "ones", accumulate them between the cycles,
// and merge all together in the end.
uint16x8_t a_matches = _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8);
c_counts_vec.u16x8 = vaddq_u16(c_counts_vec.u16x8, vandq_u16(a_matches, vdupq_n_u16(1)));
// Counting leading zeros is tricky. On Arm we can use inline Assembly to get the result,
// but MSVC doesn't support that:
//
// SIMSIMD_INTERNAL int _simsimd_clz_u64(simsimd_u64_t value) {
// simsimd_u64_t result;
// __asm__("clz %x0, %x1" : "=r"(result) : "r"(value));
// return (int)result;
// }
//
// Alternatively, we can use the `vclz_u32` NEON intrinsic.
// It will compute the leading zeros number for both `a_step` and `b_step` in parallel.
uint16x8_t a_last_broadcasted = vdupq_n_u16(a_max);
uint16x8_t b_last_broadcasted = vdupq_n_u16(b_max);
union {
uint32x2_t u32x2;
simsimd_u32_t u32[2];
} a_and_b_step;
a_and_b_step.u32[0] = _simsimd_u8_to_b1_neon(vreinterpretq_u8_u16(vcleq_u16(a_vec.u16x8, b_last_broadcasted)));
a_and_b_step.u32[1] = _simsimd_u8_to_b1_neon(vreinterpretq_u8_u16(vcleq_u16(b_vec.u16x8, a_last_broadcasted)));
a_and_b_step.u32x2 = vclz_u32(a_and_b_step.u32x2);
a += (32 - a_and_b_step.u32[0]) / 2;
b += (32 - a_and_b_step.u32[1]) / 2;
}
simsimd_intersect_u16_serial(a, b, a_end - a, b_end - b, results);
*results += vaddvq_u16(c_counts_vec.u16x8);
}
This performed very poorly and lost 50% of performance.
Speedups on x86
The new AVX-512 variant shows significant improvements in pairs/s across all benchmarks:
For |A|=128, |B|=128, |A∩B|=1, pairs/s increased
from 1.14M/s in the old implementation
to 7.73M/s in the new one, a 6.7x improvement.
At |A∩B|=64, the pairs/s rose:
from 1.13M/s in the old implementation
to 8.19M/s in the new one, a 7.2x gain.
For larger sets, like |A|=1024 and |B|=8192, with |A∩B|=10, pairs/s increased:
from 130.18k/s in the old implementation
to 194.50k/s in the new one, a 49% gain.
However, in cases like |A|=128, |B|=8192, with |A∩B|=64, pairs/s slightly decreased from 369.7k/s to 222.9k/s. Overall, the new implementation outperforms the previous one, and no case is worse than the serial version.
Speedups on Arm
On the Arm architecture, similar performance gains were achieved using the NEON and SVE2 instruction sets:
The optimized NEON implementation showed a 3.9x improvement in pairs/s for |A|=128, |B|=128, |A∩B|=1, going from 1.62M/s to 5.12M/s.
For |A∩B|=64 in the same configuration, performance improved from 1.60M/s to 5.51M/s, showing a 3.4x gain.
The SVE2 implementation also outperformed the previous SVE setup, achieving 5.6M/s (NEON) versus 1.27M/s (SVE) for |A|=128, |B|=128, |A∩B|=1, a 4.4x improvement.
In larger datasets, such as |A|=1024 and |B|=8192, the pairs/s increased from 49.3k/s to 110.03k/s, with NEON, and further to 109.47k/s with SVE2, nearly doubling the performance.
x86 Benchmarking Setup
The benchmarking was conducted on r7iz AWS instances with Intel Sapphire Rapids CPUs.
Running build_release/simsimd_bench
Run on (16 X 3900.51 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 2048 KiB (x8)
L3 Unified 61440 KiB (x1)
Chances are - you need fast set intersections! It's one of the most common operations in programming, yet one of the hardest to accelerate with SIMD! This PR improves existing kernels and adds new ones for fast set intersections of sorted arrays of unique
u16
andu32
values. Now, SimSIMD is not practically the only production codebase to use Arm SVE, but also one of the first to use the new SVE2 instructions available on Graviton 4 AWS CPUs, and coming to Nvidia's Grace Hopper, Microsoft Cobalt, and Google Axios! So upgrade to v5.2 and let's make the databases & search systems go 5x faster!Implementation Details
Move-mask on Arm
On Arm, sadly intrinsics like
vcntq_u32
andvtstq_u32
were useless, but the trick already used in StringZilla to compute the analog ofmovemask
in SSE was very handy:Matching vs Histograms on Arm
The
svmatch_u16
was used to accelerate intersections when SVE2 is available, but it's not available for 32-bit values. The naive approach is to use a combination ofsvcmpeq_u32
andsvext_u32
in a loop to check all possible pairs.A better opportunity is to use new "histogram" instructions in conjunction with reversals. They practically compute the prefix-matching count, which is equivalent to the lower triangle of the row-major intersection matrix. To compute the upper triangle, we can reverse (with
svrev_b32
) the order of elements and repeat the operation, accumulating the results for top and bottom:Portable Population Counts
Arm has no good way of computing the population count of a bitset, similar to
_mm_popcnt_u32
and_lzcnt_u32
on x86. One approach may be to use compiler intrinsics, like__builtin_popcountll
and__builtin_clzll
, but those are specific to GCC and Clang. Adding variants for MSVC doesn't help much, as on some platforms MSVC intrinsics are not available. In StringZilla the following wrappers are used:We can use inline Assembly to invoke the instruction directly. This results in a tiny 2% performance reduction:
But the problem is - MSVC doesn't support inline Assembly 🤦 :
We can also avoid population counts on every cycle and only aggregating them in the end:
One more idea I've tried, was using
CLZ
on Arm to compute two values at a time:This performed very poorly and lost 50% of performance.
Speedups on x86
The new AVX-512 variant shows significant improvements in pairs/s across all benchmarks:
|A|=128
,|B|=128
,|A∩B|=1
, pairs/s increased|A∩B|=64
, the pairs/s rose:|A|=1024
and|B|=8192
, with|A∩B|=10
, pairs/s increased:However, in cases like
|A|=128
,|B|=8192
, with|A∩B|=64
, pairs/s slightly decreased from 369.7k/s to 222.9k/s. Overall, the new implementation outperforms the previous one, and no case is worse than the serial version.Speedups on Arm
On the Arm architecture, similar performance gains were achieved using the NEON and SVE2 instruction sets:
|A|=128, |B|=128, |A∩B|=1
, going from 1.62M/s to 5.12M/s.|A∩B|=64
in the same configuration, performance improved from 1.60M/s to 5.51M/s, showing a 3.4x gain.|A|=128, |B|=128, |A∩B|=1
, a 4.4x improvement.|A|=1024
and|B|=8192
, the pairs/s increased from 49.3k/s to 110.03k/s, with NEON, and further to 109.47k/s with SVE2, nearly doubling the performance.x86 Benchmarking Setup
The benchmarking was conducted on
r7iz
AWS instances with Intel Sapphire Rapids CPUs.Old Serial Baselines
Existing AVX-512 Implementation
New AVX-512 Implementation
Arm Benchmarking Setup
The benchmarking was conducted on
r8g
AWS instances with Graviton 4 CPUs.Old Serial Baselines
Old SVE Implementation
New NEON Implementation
New SVE2 Implementation