ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
988 stars 59 forks source link

Using svhistcnt_u32_z for intersect_u32_sve2 #183

Closed MarkReedZ closed 2 months ago

MarkReedZ commented 2 months ago

Describe what you are looking for

To see if intersect_u32_sve2 can be improved with svhistcnt

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

Code of Conduct

MarkReedZ commented 2 months ago

On main-dev-intersect we proposed that svhistcnt on a,b and a,reversed(b) + accumulation may work. Will it? Not sure. I believe hist counts matches at a[i] from b[0:i]

vec1::     [3, 3, 3, 6]
vec2::     [1, 2, 3, 5]
rvec2::    [5, 3, 2, 1]
hist::     [0, 0, 1, 0]
hist_rev:: [0, 1, 1, 0]

vec1:     [1, 2, 3, 4]
vec2:     [0, 3, 3, 4]
rvec2:    [4, 3, 3, 0]
hist:     [0, 0, 2, 1]
hist_rev: [0, 0, 2, 1]

Initial code try. Its 15% faster.

        svuint32_t histcnt_upper = svhistcnt_u32_z(a_progress, a_vec, b_vec);
        svuint32_t histcnt_lower = svhistcnt_u32_z(a_progress, a_vec, svrev_u32(b_vec));
        svuint32_t hist = svorr_u32_x(a_progress, histcnt_upper, histcnt_lower);
        svbool_t not_zero = svcmpne_n_u32(a_progress, hist, 0);
        simsimd_size_t equal_count = svcntp_b32(a_progress, not_zero);
ashvardanian commented 2 months ago

@MarkReedZ, there are 2 more reversal needed to make it work.

Sadly HISTCNT solution isn't faster than NEON. Moreover, the u32_neon<|A|=128,|B|=8192 variants are surprisingly 50% faster than u32_sve2<|A|=128,|B|=8192 variants. I'm merging the new version and hope we see wider SVE registers in future Arm CPU.

Running build_release/simsimd_bench
Run on (4 X 2000 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x4)
  L1 Instruction 64 KiB (x4)
  L2 Unified 2048 KiB (x4)
  L3 Unified 36864 KiB (x1)
Load Average: 0.99, 0.83, 0.53
------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------
intersect_u32_neon<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1                347 ns          347 ns     40697958 bytes=2.95198G/s error=0 matches=1 pairs=2.8828M/s
intersect_u32_neon<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1                349 ns          349 ns     40236329 bytes=2.9333G/s error=0 matches=6 pairs=2.86455M/s
intersect_u32_neon<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1               342 ns          342 ns     40748411 bytes=2.9898G/s error=0 matches=64 pairs=2.91972M/s
intersect_u32_neon<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1              298 ns          298 ns     46791238 bytes=3.43996G/s error=0 matches=121 pairs=3.35933M/s
intersect_u32_neon<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1              1362 ns         1362 ns     10136110 bytes=3.38332G/s error=0 matches=1 pairs=734.227k/s
intersect_u32_neon<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1              1362 ns         1362 ns      9905936 bytes=3.38259G/s error=0 matches=6 pairs=734.07k/s
intersect_u32_neon<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1             1438 ns         1438 ns      9581767 bytes=3.20387G/s error=0 matches=64 pairs=695.283k/s
intersect_u32_neon<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1            1506 ns         1506 ns      9142785 bytes=3.06057G/s error=0 matches=121 pairs=664.186k/s
intersect_u32_neon<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1              2620 ns         2620 ns      5243539 bytes=12.7033G/s error=0 matches=1 pairs=381.711k/s
intersect_u32_neon<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1              2588 ns         2588 ns      5254416 bytes=12.8586G/s error=0 matches=6 pairs=386.375k/s
intersect_u32_neon<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1             2649 ns         2649 ns      5279233 bytes=12.5632G/s error=0 matches=64 pairs=377.5k/s
intersect_u32_neon<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1            2662 ns         2662 ns      5302739 bytes=12.5016G/s error=0 matches=121 pairs=375.649k/s
intersect_u32_neon<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1            3001 ns         3001 ns      4635267 bytes=2.72975G/s error=0 matches=10 pairs=333.221k/s
intersect_u32_neon<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1            2994 ns         2994 ns      4653077 bytes=2.73625G/s error=0 matches=51 pairs=334.014k/s
intersect_u32_neon<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1           2885 ns         2885 ns      4857437 bytes=2.83903G/s error=0 matches=512 pairs=346.561k/s
intersect_u32_neon<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1           2445 ns         2445 ns      5727964 bytes=3.35091G/s error=0 matches=972 pairs=409.047k/s
intersect_u32_neon<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1           12045 ns        12045 ns      1154950 bytes=3.06049G/s error=0 matches=10 pairs=83.021k/s
intersect_u32_neon<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1           12080 ns        12080 ns      1146339 bytes=3.05164G/s error=0 matches=51 pairs=82.7811k/s
intersect_u32_neon<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1          12567 ns        12567 ns      1110004 bytes=2.93331G/s error=0 matches=512 pairs=79.571k/s
intersect_u32_neon<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1          13068 ns        13068 ns      1067483 bytes=2.82091G/s error=0 matches=972 pairs=76.5222k/s
intersect_u32_sve2<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1                347 ns          347 ns     40622319 bytes=2.95225G/s error=0 matches=1 pairs=2.88305M/s
intersect_u32_sve2<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1                343 ns          343 ns     40350774 bytes=2.9851G/s error=0 matches=6 pairs=2.91514M/s
intersect_u32_sve2<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1               340 ns          340 ns     40523956 bytes=3.01281G/s error=0 matches=64 pairs=2.94219M/s
intersect_u32_sve2<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1              291 ns          291 ns     48223598 bytes=3.51391G/s error=0 matches=121 pairs=3.43156M/s
intersect_u32_sve2<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1              1312 ns         1312 ns     10433130 bytes=3.51351G/s error=0 matches=1 pairs=762.48k/s
intersect_u32_sve2<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1              1319 ns         1319 ns     10470548 bytes=3.49244G/s error=0 matches=6 pairs=757.908k/s
intersect_u32_sve2<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1             1363 ns         1363 ns      9973908 bytes=3.38024G/s error=0 matches=64 pairs=733.56k/s
intersect_u32_sve2<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1            1421 ns         1421 ns      9429782 bytes=3.24226G/s error=0 matches=121 pairs=703.616k/s
intersect_u32_sve2<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1              3972 ns         3972 ns      3517901 bytes=8.37959G/s error=0 matches=1 pairs=251.791k/s
intersect_u32_sve2<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1              3967 ns         3967 ns      3512859 bytes=8.39006G/s error=0 matches=6 pairs=252.105k/s
intersect_u32_sve2<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1             3982 ns         3982 ns      3495200 bytes=8.35755G/s error=0 matches=64 pairs=251.128k/s
intersect_u32_sve2<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1            4011 ns         4011 ns      3491284 bytes=8.29701G/s error=0 matches=121 pairs=249.309k/s
intersect_u32_sve2<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1            2907 ns         2907 ns      4744735 bytes=2.81786G/s error=0 matches=10 pairs=343.977k/s
intersect_u32_sve2<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1            2930 ns         2930 ns      4768146 bytes=2.79595G/s error=0 matches=51 pairs=341.303k/s
intersect_u32_sve2<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1           2770 ns         2770 ns      5049316 bytes=2.95749G/s error=0 matches=512 pairs=361.022k/s
intersect_u32_sve2<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1           2361 ns         2361 ns      5933107 bytes=3.46948G/s error=0 matches=972 pairs=423.521k/s
intersect_u32_sve2<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1           12742 ns        12742 ns      1095276 bytes=2.89313G/s error=0 matches=10 pairs=78.4813k/s
intersect_u32_sve2<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1           12838 ns        12838 ns      1094396 bytes=2.87149G/s error=0 matches=51 pairs=77.894k/s
intersect_u32_sve2<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1          13336 ns        13336 ns      1052936 bytes=2.76422G/s error=0 matches=512 pairs=74.9842k/s
intersect_u32_sve2<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1          13818 ns        13818 ns      1012086 bytes=2.66786G/s error=0 matches=972 pairs=72.3702k/s
intersect_u32_serial<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1              607 ns          607 ns     23280255 bytes=1.68717G/s error=0 matches=1 pairs=1.64763M/s
intersect_u32_serial<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1              604 ns          604 ns     23213860 bytes=1.69627G/s error=0 matches=6 pairs=1.65651M/s
intersect_u32_serial<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1             615 ns          615 ns     22758218 bytes=1.66488G/s error=0 matches=64 pairs=1.62586M/s
intersect_u32_serial<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1            664 ns          664 ns     21163701 bytes=1.54276G/s error=0 matches=121 pairs=1.5066M/s
intersect_u32_serial<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1            2586 ns         2586 ns      5406793 bytes=1.78181G/s error=0 matches=1 pairs=386.678k/s
intersect_u32_serial<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1            2576 ns         2576 ns      5413131 bytes=1.78908G/s error=0 matches=6 pairs=388.256k/s
intersect_u32_serial<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1           2614 ns         2572 ns      5430697 bytes=1.79153G/s error=0 matches=64 pairs=388.787k/s
intersect_u32_serial<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1          2589 ns         2578 ns      5420325 bytes=1.78724G/s error=0 matches=121 pairs=387.856k/s
intersect_u32_serial<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1            4370 ns         4370 ns      3210747 bytes=7.6164G/s error=0 matches=1 pairs=228.858k/s
intersect_u32_serial<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1            4344 ns         4344 ns      3184249 bytes=7.66047G/s error=0 matches=6 pairs=230.182k/s
intersect_u32_serial<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1           4361 ns         4361 ns      3211661 bytes=7.63195G/s error=0 matches=64 pairs=229.326k/s
intersect_u32_serial<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1          4370 ns         4370 ns      3174111 bytes=7.61502G/s error=0 matches=121 pairs=228.817k/s
intersect_u32_serial<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1          4562 ns         4562 ns      3067578 bytes=1.79582G/s error=0 matches=10 pairs=219.217k/s
intersect_u32_serial<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1          4550 ns         4550 ns      3068627 bytes=1.80043G/s error=0 matches=51 pairs=219.78k/s
intersect_u32_serial<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1         4558 ns         4558 ns      3076772 bytes=1.79728G/s error=0 matches=512 pairs=219.395k/s
intersect_u32_serial<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1         4490 ns         4491 ns      3117760 bytes=1.82428G/s error=0 matches=972 pairs=222.69k/s
intersect_u32_serial<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1         20310 ns        20311 ns       689896 bytes=1.81501G/s error=0 matches=10 pairs=49.2352k/s
intersect_u32_serial<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1         20276 ns        20276 ns       689898 bytes=1.81808G/s error=0 matches=51 pairs=49.3186k/s
intersect_u32_serial<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1        20289 ns        20289 ns       689888 bytes=1.81692G/s error=0 matches=512 pairs=49.2872k/s
intersect_u32_serial<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1        20311 ns        20312 ns       689865 bytes=1.81493G/s error=0 matches=972 pairs=49.233k/s