ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
998 stars 59 forks source link

5x Faster Set Intersections in AVX-512 & SVE2 #174

Closed ashvardanian closed 2 months ago

ashvardanian commented 2 months ago

set-intersections

Chances are - you need fast set intersections! It's one of the most common operations in programming, yet one of the hardest to accelerate with SIMD! This PR improves existing kernels and adds new ones for fast set intersections of sorted arrays of unique u16 and u32 values. Now, SimSIMD is not practically the only production codebase to use Arm SVE, but also one of the first to use the new SVE2 instructions available on Graviton 4 AWS CPUs, and coming to Nvidia's Grace Hopper, Microsoft Cobalt, and Google Axios! So upgrade to v5.2 and let's make the databases & search systems go 5x faster!

Implementation Details

Move-mask on Arm

On Arm, sadly intrinsics like vcntq_u32 and vtstq_u32 were useless, but the trick already used in StringZilla to compute the analog of movemask in SSE was very handy:

SIMSIMD_INTERNAL simsimd_u64_t _simsimd_u8_to_u4_neon(uint8x16_t vec) {
    return vget_lane_u64(vreinterpret_u64_u8(vshrn_n_u16(vreinterpretq_u16_u8(vec), 4)), 0);
}

Matching vs Histograms on Arm

The svmatch_u16 was used to accelerate intersections when SVE2 is available, but it's not available for 32-bit values. The naive approach is to use a combination of svcmpeq_u32 and svext_u32 in a loop to check all possible pairs.

// Alternatively, one can use histogram instructions, like `svhistcnt_u32_z`.
// They practically compute the prefix-matching count, which is equivalent to
// the upper triangle of the intersection matrix.
// To compute the lower triangle, we can reverse (with `svrev_b32`) the order of elements
// in one vector and repeat the operation, accumulating the results for top and bottom.
svbool_t equal_mask = svpfalse_b();
for (simsimd_size_t i = 0; i < register_size; i++) {
    equal_mask = svorr_z(svptrue_b32(), equal_mask, svcmpeq_u32(a_progress, a_vec, b_vec));
    b_vec = svext_u32(b_vec, b_vec, 1);
}
simsimd_size_t equal_count = svcntp_b32(a_progress, equal_mask);

A better opportunity is to use new "histogram" instructions in conjunction with reversals. They practically compute the prefix-matching count, which is equivalent to the lower triangle of the row-major intersection matrix. To compute the upper triangle, we can reverse (with svrev_b32) the order of elements and repeat the operation, accumulating the results for top and bottom:

//      ⊐ α = {A, B, C, D}, β = {X, Y, Z, W}:
//
//      hist(α, β):           hist(α_rev, β_rev):
//
//        X Y Z W               W Z Y X
//      A 1 0 0 0             D 1 0 0 0
//      B 1 1 0 0             C 1 1 0 0
//      C 1 1 1 0             B 1 1 1 0
//      D 1 1 1 1             A 1 1 1 1
//
svuint32_t hist_lower = svhistcnt_u32_z(a_progress, a_vec, b_vec);
svuint32_t a_rev_vec = svrev_u32(a_vec);
svuint32_t b_rev_vec = svrev_u32(b_vec);
svuint32_t hist_upper = svrev_u32(svhistcnt_u32_z(svptrue_b32(), a_rev_vec, b_rev_vec));
svuint32_t hist = svorr_u32_x(a_progress, hist_lower, hist_upper);
svbool_t equal_mask = svcmpne_n_u32(a_progress, hist, 0);
simsimd_size_t equal_count = svcntp_b32(a_progress, equal_mask);

Portable Population Counts

Arm has no good way of computing the population count of a bitset, similar to _mm_popcnt_u32 and _lzcnt_u32 on x86. One approach may be to use compiler intrinsics, like __builtin_popcountll and __builtin_clzll, but those are specific to GCC and Clang. Adding variants for MSVC doesn't help much, as on some platforms MSVC intrinsics are not available. In StringZilla the following wrappers are used:

/*  Intrinsics aliases for MSVC, GCC, Clang, and Clang-Cl.
 *  The following section of compiler intrinsics comes in 2 flavors.
 */
#if defined(_MSC_VER) && !defined(__clang__) // On Clang-CL
#include <intrin.h>
// Sadly, when building Win32 images, we can't use the `_tzcnt_u64`, `_lzcnt_u64`,
// `_BitScanForward64`, or `_BitScanReverse64` intrinsics. For now it's a simple `for`-loop.
#if (defined(_WIN32) && !defined(_WIN64)) || defined(_M_ARM) || defined(_M_ARM64)
SZ_INTERNAL int sz_u64_ctz(sz_u64_t x) {
    sz_assert(x != 0);
    int n = 0;
    while ((x & 1) == 0) { n++, x >>= 1; }
    return n;
}
SZ_INTERNAL int sz_u64_clz(sz_u64_t x) {
    sz_assert(x != 0);
    int n = 0;
    while ((x & 0x8000000000000000ull) == 0) { n++, x <<= 1; }
    return n;
}
SZ_INTERNAL int sz_u64_popcount(sz_u64_t x) {
    x = x - ((x >> 1) & 0x5555555555555555ull);
    x = (x & 0x3333333333333333ull) + ((x >> 2) & 0x3333333333333333ull);
    return (((x + (x >> 4)) & 0x0F0F0F0F0F0F0F0Full) * 0x0101010101010101ull) >> 56;
}
SZ_INTERNAL int sz_u32_ctz(sz_u32_t x) {
    sz_assert(x != 0);
    int n = 0;
    while ((x & 1) == 0) { n++, x >>= 1; }
    return n;
}
SZ_INTERNAL int sz_u32_clz(sz_u32_t x) {
    sz_assert(x != 0);
    int n = 0;
    while ((x & 0x80000000u) == 0) { n++, x <<= 1; }
    return n;
}
SZ_INTERNAL int sz_u32_popcount(sz_u32_t x) {
    x = x - ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    return (((x + (x >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
#else
SZ_INTERNAL int sz_u64_ctz(sz_u64_t x) { return (int)_tzcnt_u64(x); }
SZ_INTERNAL int sz_u64_clz(sz_u64_t x) { return (int)_lzcnt_u64(x); }
SZ_INTERNAL int sz_u64_popcount(sz_u64_t x) { return (int)__popcnt64(x); }
SZ_INTERNAL int sz_u32_ctz(sz_u32_t x) { return (int)_tzcnt_u32(x); }
SZ_INTERNAL int sz_u32_clz(sz_u32_t x) { return (int)_lzcnt_u32(x); }
SZ_INTERNAL int sz_u32_popcount(sz_u32_t x) { return (int)__popcnt(x); }
#endif
#else
SZ_INTERNAL int sz_u64_popcount(sz_u64_t x) { return __builtin_popcountll(x); }
SZ_INTERNAL int sz_u32_popcount(sz_u32_t x) { return __builtin_popcount(x); }
SZ_INTERNAL int sz_u64_ctz(sz_u64_t x) { return __builtin_ctzll(x); }
SZ_INTERNAL int sz_u64_clz(sz_u64_t x) { return __builtin_clzll(x); }
SZ_INTERNAL int sz_u32_ctz(sz_u32_t x) { return __builtin_ctz(x); } // ! Undefined if `x == 0`
SZ_INTERNAL int sz_u32_clz(sz_u32_t x) { return __builtin_clz(x); } // ! Undefined if `x == 0`
#endif

We can use inline Assembly to invoke the instruction directly. This results in a tiny 2% performance reduction:

SIMSIMD_INTERNAL int _simsimd_clz_u64(simsimd_u64_t value) {
    simsimd_u64_t result;
    __asm__("clz %x0, %x1" : "=r"(result) : "r"(value));
    return (int)result;
}

But the problem is - MSVC doesn't support inline Assembly 🤦 :

        D:\a\SimSIMD\SimSIMD\include\simsimd\sparse.h(408): warning C4013: '__asm__' undefined; assuming extern returning int
        D:\a\SimSIMD\SimSIMD\include\simsimd\sparse.h(408): error C2143: syntax error: missing ')' before ':'

We can also avoid population counts on every cycle and only aggregating them in the end:

// Now we are likely to have some overlap, so we can intersect the registers.
// We can do it by performing a population count at every cycle, but it's not the cheapest in terms of cycles.
//
//      simsimd_u64_t a_matches = __builtin_popcountll(
//          _simsimd_u8_to_u4_neon(vreinterpretq_u8_u16(
//              _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8))));
//      c += a_matches / 8;
//
// Alternatively, we can we can transform match-masks into "ones", accumulate them between the cycles,
// and merge all together in the end.
uint16x8_t a_matches = _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8);
c_counts_vec.u16x8 = vaddq_u16(c_counts_vec.u16x8, vandq_u16(a_matches, vdupq_n_u16(1)));

One more idea I've tried, was using CLZ on Arm to compute two values at a time:

SIMSIMD_INTERNAL simsimd_u32_t _simsimd_u8_to_b1_neon(uint8x16_t vec) {
    simsimd_i8_t const
        __attribute__((aligned(16))) shift_table[16] = {-7, -6, -5, -4, -3, -2, -1, 0, -7, -6, -5, -4, -3, -2, -1, 0};
    int8x16_t shift_vec = vld1q_s8(shift_table);
    uint8x16_t mask_vec = vandq_u8(vec, vdupq_n_u8(0x80));
    mask_vec = vshlq_u8(mask_vec, shift_vec);
    simsimd_u32_t out = vaddv_u8(vget_low_u8(mask_vec));
    out += (vaddv_u8(vget_high_u8(mask_vec)) << 8);
    return out;
}

SIMSIMD_INTERNAL uint16x8_t _simsimd_intersect_u16x8_neon(uint16x8_t a, uint16x8_t b) {
    uint16x8_t b1 = vextq_u16(b, b, 1);
    uint16x8_t b2 = vextq_u16(b, b, 2);
    uint16x8_t b3 = vextq_u16(b, b, 3);
    uint16x8_t b4 = vextq_u16(b, b, 4);
    uint16x8_t b5 = vextq_u16(b, b, 5);
    uint16x8_t b6 = vextq_u16(b, b, 6);
    uint16x8_t b7 = vextq_u16(b, b, 7);
    uint16x8_t nm00 = vceqq_u16(a, b);
    uint16x8_t nm01 = vceqq_u16(a, b1);
    uint16x8_t nm02 = vceqq_u16(a, b2);
    uint16x8_t nm03 = vceqq_u16(a, b3);
    uint16x8_t nm04 = vceqq_u16(a, b4);
    uint16x8_t nm05 = vceqq_u16(a, b5);
    uint16x8_t nm06 = vceqq_u16(a, b6);
    uint16x8_t nm07 = vceqq_u16(a, b7);
    uint16x8_t nm = vorrq_u16(vorrq_u16(vorrq_u16(nm00, nm01), vorrq_u16(nm02, nm03)),
                              vorrq_u16(vorrq_u16(nm04, nm05), vorrq_u16(nm06, nm07)));
    return nm;
}

SIMSIMD_PUBLIC void simsimd_intersect_u16_neon(simsimd_u16_t const* a, simsimd_u16_t const* b, simsimd_size_t a_length,
                                               simsimd_size_t b_length, simsimd_distance_t* results) {

    // The baseline implementation for very small arrays (2 registers or less) can be quite simple:
    if (a_length < 32 && b_length < 32) {
        simsimd_intersect_u16_serial(a, b, a_length, b_length, results);
        return;
    }

    simsimd_u16_t const* const a_end = a + a_length;
    simsimd_u16_t const* const b_end = b + b_length;
    union vec_t {
        uint16x8_t u16x8;
        simsimd_u16_t u16[8];
        simsimd_u8_t u8[16];
    } a_vec, b_vec, c_counts_vec;
    c_counts_vec.u16x8 = vdupq_n_u16(0);

    while (a + 8 < a_end && b + 8 < b_end) {
        a_vec.u16x8 = vld1q_u16(a);
        b_vec.u16x8 = vld1q_u16(b);

        // Intersecting registers with `_simsimd_intersect_u16x8_neon` involves a lot of shuffling
        // and comparisons, so we want to avoid it if the slices don't overlap at all..
        simsimd_u16_t a_min;
        simsimd_u16_t a_max = a_vec.u16[7];
        simsimd_u16_t b_min = b_vec.u16[0];
        simsimd_u16_t b_max = b_vec.u16[7];

        // If the slices don't overlap, advance the appropriate pointer
        while (a_max < b_min && a + 16 < a_end) {
            a += 8;
            a_vec.u16x8 = vld1q_u16(a);
            a_max = a_vec.u16[7];
        }
        a_min = a_vec.u16[0];
        while (b_max < a_min && b + 16 < b_end) {
            b += 8;
            b_vec.u16x8 = vld1q_u16(b);
            b_max = b_vec.u16[7];
        }
        b_min = b_vec.u16[0];

        // Now we are likely to have some overlap, so we can intersect the registers.
        // We can do it by performing a population count at every cycle, but it's not the cheapest in terms of cycles.
        //
        //      simsimd_u64_t a_matches = __builtin_popcountll(
        //          _simsimd_u8_to_u4_neon(vreinterpretq_u8_u16(
        //              _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8))));
        //      c += a_matches / 8;
        //
        // Alternatively, we can we can transform match-masks into "ones", accumulate them between the cycles,
        // and merge all together in the end.
        uint16x8_t a_matches = _simsimd_intersect_u16x8_neon(a_vec.u16x8, b_vec.u16x8);
        c_counts_vec.u16x8 = vaddq_u16(c_counts_vec.u16x8, vandq_u16(a_matches, vdupq_n_u16(1)));

        // Counting leading zeros is tricky. On Arm we can use inline Assembly to get the result,
        // but MSVC doesn't support that:
        //
        //      SIMSIMD_INTERNAL int _simsimd_clz_u64(simsimd_u64_t value) {
        //          simsimd_u64_t result;
        //          __asm__("clz %x0, %x1" : "=r"(result) : "r"(value));
        //          return (int)result;
        //      }
        //
        // Alternatively, we can use the `vclz_u32` NEON intrinsic.
        // It will compute the leading zeros number for both `a_step` and `b_step` in parallel.
        uint16x8_t a_last_broadcasted = vdupq_n_u16(a_max);
        uint16x8_t b_last_broadcasted = vdupq_n_u16(b_max);
        union {
            uint32x2_t u32x2;
            simsimd_u32_t u32[2];
        } a_and_b_step;
        a_and_b_step.u32[0] = _simsimd_u8_to_b1_neon(vreinterpretq_u8_u16(vcleq_u16(a_vec.u16x8, b_last_broadcasted)));
        a_and_b_step.u32[1] = _simsimd_u8_to_b1_neon(vreinterpretq_u8_u16(vcleq_u16(b_vec.u16x8, a_last_broadcasted)));
        a_and_b_step.u32x2 = vclz_u32(a_and_b_step.u32x2);
        a += (32 - a_and_b_step.u32[0]) / 2;
        b += (32 - a_and_b_step.u32[1]) / 2;
    }

    simsimd_intersect_u16_serial(a, b, a_end - a, b_end - b, results);
    *results += vaddvq_u16(c_counts_vec.u16x8);
}

This performed very poorly and lost 50% of performance.

Speedups on x86

The new AVX-512 variant shows significant improvements in pairs/s across all benchmarks:

However, in cases like |A|=128, |B|=8192, with |A∩B|=64, pairs/s slightly decreased from 369.7k/s to 222.9k/s. Overall, the new implementation outperforms the previous one, and no case is worse than the serial version.

Speedups on Arm

On the Arm architecture, similar performance gains were achieved using the NEON and SVE2 instruction sets:

x86 Benchmarking Setup

The benchmarking was conducted on r7iz AWS instances with Intel Sapphire Rapids CPUs.

Running build_release/simsimd_bench
Run on (16 X 3900.51 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 2048 KiB (x8)
  L3 Unified 61440 KiB (x1)

Old Serial Baselines

-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1             567 ns          567 ns     24785678 pairs=1.76263M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1             567 ns          567 ns     24598141 pairs=1.76286M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1            569 ns          569 ns     24741572 pairs=1.75684M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1           568 ns          568 ns     24871638 pairs=1.76073M/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1           2508 ns         2508 ns      5591748 pairs=398.803k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1           2509 ns         2509 ns      5589871 pairs=398.535k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1          2530 ns         2530 ns      5564535 pairs=395.33k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1         2522 ns         2522 ns      5532306 pairs=396.447k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1           4791 ns         4791 ns      2920833 pairs=208.737k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1           4800 ns         4800 ns      2923139 pairs=208.346k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1          4821 ns         4820 ns      2906942 pairs=207.448k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1         4843 ns         4843 ns      2897334 pairs=206.504k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1         4484 ns         4484 ns      3122873 pairs=223.023k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1         4479 ns         4479 ns      3124662 pairs=223.261k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1        4484 ns         4484 ns      3125584 pairs=223.034k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1        4500 ns         4500 ns      3104588 pairs=222.229k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1        20118 ns        20117 ns       696244 pairs=49.7084k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1        20134 ns        20134 ns       696160 pairs=49.6682k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1       20125 ns        20124 ns       695799 pairs=49.6911k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1       20102 ns        20102 ns       695762 pairs=49.7464k/s

Existing AVX-512 Implementation

-------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=1>/min_time:10.000/threads:1             875 ns          875 ns     16248886 pairs=1.14342M/s
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=6>/min_time:10.000/threads:1             873 ns          873 ns     16081249 pairs=1.14555M/s
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=64>/min_time:10.000/threads:1            882 ns          882 ns     15851609 pairs=1.13354M/s
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=121>/min_time:10.000/threads:1           916 ns          916 ns     15282595 pairs=1091.32k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1            955 ns          955 ns     14660187 pairs=1047.53k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1            955 ns          955 ns     14663375 pairs=1047.57k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1           952 ns          952 ns     14702462 pairs=1050.17k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1          949 ns          949 ns     14743103 pairs=1053.59k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1           2718 ns         2718 ns      5168053 pairs=367.871k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1           2698 ns         2698 ns      5155819 pairs=370.664k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1          2705 ns         2705 ns      5203675 pairs=369.686k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1         2693 ns         2693 ns      5187007 pairs=371.377k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1         7310 ns         7310 ns      1910292 pairs=136.8k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1         7312 ns         7312 ns      1913190 pairs=136.759k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1        7365 ns         7365 ns      1900946 pairs=135.781k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1        7439 ns         7439 ns      1882319 pairs=134.43k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1         7682 ns         7681 ns      1821784 pairs=130.183k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1         7695 ns         7695 ns      1821861 pairs=129.955k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1        7643 ns         7643 ns      1829955 pairs=130.842k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1        7617 ns         7617 ns      1838612 pairs=131.279k/s

New AVX-512 Implementation

--------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1             129 ns          129 ns    101989513 pairs=7.72559M/s
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1             134 ns          134 ns    107140278 pairs=7.46949M/s
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1            122 ns          122 ns    113134485 pairs=8.18634M/s
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1           114 ns          114 ns    122765163 pairs=8.75268M/s
intersect_u16_ice<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1           1042 ns         1042 ns     13412933 pairs=959.711k/s
intersect_u16_ice<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1           1035 ns         1035 ns     13423867 pairs=966.278k/s
intersect_u16_ice<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1          1038 ns         1038 ns     13401265 pairs=963.267k/s
intersect_u16_ice<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1         1055 ns         1055 ns     13170438 pairs=948.193k/s
intersect_u16_ice<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1           4315 ns         4315 ns      3024069 pairs=231.776k/s
intersect_u16_ice<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1           3999 ns         3999 ns      3371134 pairs=250.088k/s
intersect_u16_ice<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1          4486 ns         4486 ns      3278143 pairs=222.9k/s
intersect_u16_ice<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1         4525 ns         4525 ns      3170802 pairs=220.991k/s
intersect_u16_ice<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1          817 ns          817 ns     17102654 pairs=1.22419M/s
intersect_u16_ice<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1          820 ns          820 ns     17168886 pairs=1.22003M/s
intersect_u16_ice<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1         793 ns          793 ns     17756237 pairs=1.26107M/s
intersect_u16_ice<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1         747 ns          747 ns     18261381 pairs=1.33794M/s
intersect_u16_ice<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1         5142 ns         5142 ns      2728465 pairs=194.496k/s
intersect_u16_ice<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1         5114 ns         5114 ns      2727670 pairs=195.56k/s
intersect_u16_ice<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1        5142 ns         5142 ns      2716714 pairs=194.491k/s
intersect_u16_ice<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1        5151 ns         5151 ns      2721708 pairs=194.148k/s

Arm Benchmarking Setup

The benchmarking was conducted on r8g AWS instances with Graviton 4 CPUs.

Running build_release/simsimd_bench
Run on (2 X 2000 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x2)
  L1 Instruction 64 KiB (x2)
  L2 Unified 2048 KiB (x2)
  L3 Unified 36864 KiB (x1)

Old Serial Baselines

----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                            Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1            615 ns          614 ns     22780083 pairs=1.62833M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1            610 ns          608 ns     22727971 pairs=1.64341M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1           622 ns          622 ns     22356453 pairs=1.60786M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1          679 ns          679 ns     20641056 pairs=1.47332M/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1          2542 ns         2542 ns      5511491 pairs=393.332k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1          2539 ns         2539 ns      5512132 pairs=393.822k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1         2535 ns         2535 ns      5511950 pairs=394.436k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1        2546 ns         2546 ns      5504586 pairs=392.843k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1          4122 ns         4122 ns      3374465 pairs=242.586k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1          4117 ns         4117 ns      3372418 pairs=242.884k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1         4138 ns         4138 ns      3374977 pairs=241.657k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1        4142 ns         4142 ns      3361656 pairs=241.412k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1        4569 ns         4564 ns      3072148 pairs=219.129k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1        4557 ns         4557 ns      3075313 pairs=219.419k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1       4577 ns         4577 ns      3052064 pairs=218.472k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1       4728 ns         4728 ns      2980530 pairs=211.504k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1       20278 ns        20273 ns       690191 pairs=49.3276k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1       21192 ns        20272 ns       691680 pairs=49.3302k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1      21438 ns        20268 ns       689617 pairs=49.3384k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1      22010 ns        20317 ns       692675 pairs=49.2207k/s

Old SVE Implementation

-------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                         Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_sve<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1            794 ns          788 ns     17715501 pairs=1.26918M/s
intersect_u16_sve<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1            809 ns          785 ns     17579527 pairs=1.27438M/s
intersect_u16_sve<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1           819 ns          810 ns     17229391 pairs=1.23482M/s
intersect_u16_sve<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1          878 ns          856 ns     16347952 pairs=1.16827M/s
intersect_u16_sve<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1          1475 ns         1380 ns     10129190 pairs=724.869k/s
intersect_u16_sve<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1          1400 ns         1361 ns     10312201 pairs=734.514k/s
intersect_u16_sve<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1         1353 ns         1344 ns     10427410 pairs=743.793k/s
intersect_u16_sve<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1        1369 ns         1350 ns     10516190 pairs=740.815k/s
intersect_u16_sve<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1          7156 ns         7009 ns      1991602 pairs=142.677k/s
intersect_u16_sve<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1          7095 ns         6982 ns      2006057 pairs=143.232k/s
intersect_u16_sve<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1         7328 ns         6967 ns      2004803 pairs=143.537k/s
intersect_u16_sve<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1        6966 ns         6963 ns      2013422 pairs=143.624k/s
intersect_u16_sve<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1        7119 ns         6517 ns      2143784 pairs=153.437k/s
intersect_u16_sve<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1        6978 ns         6522 ns      2146331 pairs=153.331k/s
intersect_u16_sve<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1       6721 ns         6533 ns      2141325 pairs=153.067k/s
intersect_u16_sve<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1       7046 ns         6675 ns      2095016 pairs=149.823k/s
intersect_u16_sve<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1       10819 ns        10722 ns      1307796 pairs=93.2695k/s
intersect_u16_sve<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1       11295 ns        10729 ns      1305575 pairs=93.2031k/s
intersect_u16_sve<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1      10596 ns        10596 ns      1317798 pairs=94.3769k/s
intersect_u16_sve<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1      10527 ns        10486 ns      1337148 pairs=95.3626k/s

New NEON Implementation

--------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_neon<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1            195 ns          195 ns     72473251 pairs=5.12346M/s
intersect_u16_neon<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1            193 ns          193 ns     71826322 pairs=5.17983M/s
intersect_u16_neon<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1           181 ns          181 ns     76859132 pairs=5.51211M/s
intersect_u16_neon<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1          161 ns          161 ns     86301671 pairs=6.22906M/s
intersect_u16_neon<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1          1199 ns         1027 ns     13866808 pairs=973.295k/s
intersect_u16_neon<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1          1171 ns         1034 ns     13729254 pairs=966.886k/s
intersect_u16_neon<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1         1120 ns         1038 ns     13671085 pairs=963.804k/s
intersect_u16_neon<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1        1150 ns         1051 ns     13070692 pairs=951.238k/s
intersect_u16_neon<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1          2587 ns         2446 ns      5685615 pairs=408.885k/s
intersect_u16_neon<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1          2595 ns         2490 ns      5538880 pairs=401.615k/s
intersect_u16_neon<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1         2482 ns         2460 ns      5704185 pairs=406.459k/s
intersect_u16_neon<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1        2512 ns         2512 ns      5592948 pairs=398.064k/s
intersect_u16_neon<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1        1599 ns         1573 ns      8893290 pairs=635.781k/s
intersect_u16_neon<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1        1570 ns         1570 ns      8950291 pairs=637.098k/s
intersect_u16_neon<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1       1488 ns         1488 ns      9449103 pairs=672.121k/s
intersect_u16_neon<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1       1332 ns         1332 ns     10582682 pairs=751.007k/s
intersect_u16_neon<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1        8997 ns         8997 ns      1556944 pairs=111.144k/s
intersect_u16_neon<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1        8999 ns         8999 ns      1554324 pairs=111.128k/s
intersect_u16_neon<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1       9126 ns         9070 ns      1543769 pairs=110.257k/s
intersect_u16_neon<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1       9089 ns         9089 ns      1536462 pairs=110.029k/s

New SVE2 Implementation

--------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_sve2<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1            179 ns          178 ns     77997900 pairs=5.60245M/s
intersect_u16_sve2<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1            179 ns          179 ns     77959137 pairs=5.59776M/s
intersect_u16_sve2<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1           170 ns          170 ns     82829421 pairs=5.88598M/s
intersect_u16_sve2<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1          143 ns          143 ns     97771708 pairs=6.9995M/s
intersect_u16_sve2<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1           900 ns          900 ns     15430306 pairs=1.11111M/s
intersect_u16_sve2<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1           909 ns          909 ns     15374525 pairs=1099.58k/s
intersect_u16_sve2<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1          922 ns          922 ns     15025863 pairs=1085.12k/s
intersect_u16_sve2<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1         932 ns          932 ns     15083373 pairs=1072.6k/s
intersect_u16_sve2<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1          2135 ns         2135 ns      6460842 pairs=468.333k/s
intersect_u16_sve2<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1          2118 ns         2118 ns      6509484 pairs=472.238k/s
intersect_u16_sve2<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1         2138 ns         2138 ns      6468742 pairs=467.706k/s
intersect_u16_sve2<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1        2136 ns         2136 ns      6419653 pairs=468.097k/s
intersect_u16_sve2<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1        1502 ns         1502 ns      9329372 pairs=665.698k/s
intersect_u16_sve2<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1        1492 ns         1492 ns      9375601 pairs=670.246k/s
intersect_u16_sve2<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1       1416 ns         1416 ns      9859829 pairs=706.16k/s
intersect_u16_sve2<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1       1274 ns         1274 ns     11052636 pairs=785.05k/s
intersect_u16_sve2<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1        9148 ns         9148 ns      1528714 pairs=109.319k/s
intersect_u16_sve2<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1        9150 ns         9150 ns      1529679 pairs=109.287k/s
intersect_u16_sve2<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1       9148 ns         9147 ns      1527762 pairs=109.32k/s
intersect_u16_sve2<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1       9135 ns         9135 ns      1529316 pairs=109.473k/s