Closed MarkReedZ closed 2 months ago
On main-dev-intersect we proposed that svhistcnt on a,b and a,reversed(b) + accumulation may work. Will it? Not sure. I believe hist counts matches at a[i] from b[0:i]
vec1:: [3, 3, 3, 6]
vec2:: [1, 2, 3, 5]
rvec2:: [5, 3, 2, 1]
hist:: [0, 0, 1, 0]
hist_rev:: [0, 1, 1, 0]
vec1: [1, 2, 3, 4]
vec2: [0, 3, 3, 4]
rvec2: [4, 3, 3, 0]
hist: [0, 0, 2, 1]
hist_rev: [0, 0, 2, 1]
Initial code try. Its 15% faster.
svuint32_t histcnt_upper = svhistcnt_u32_z(a_progress, a_vec, b_vec);
svuint32_t histcnt_lower = svhistcnt_u32_z(a_progress, a_vec, svrev_u32(b_vec));
svuint32_t hist = svorr_u32_x(a_progress, histcnt_upper, histcnt_lower);
svbool_t not_zero = svcmpne_n_u32(a_progress, hist, 0);
simsimd_size_t equal_count = svcntp_b32(a_progress, not_zero);
@MarkReedZ, there are 2 more reversal needed to make it work.
Sadly HISTCNT
solution isn't faster than NEON. Moreover, the u32_neon<|A|=128,|B|=8192
variants are surprisingly 50% faster than u32_sve2<|A|=128,|B|=8192
variants. I'm merging the new version and hope we see wider SVE registers in future Arm CPU.
Running build_release/simsimd_bench
Run on (4 X 2000 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x4)
L1 Instruction 64 KiB (x4)
L2 Unified 2048 KiB (x4)
L3 Unified 36864 KiB (x1)
Load Average: 0.99, 0.83, 0.53
------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------
intersect_u32_neon<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1 347 ns 347 ns 40697958 bytes=2.95198G/s error=0 matches=1 pairs=2.8828M/s
intersect_u32_neon<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1 349 ns 349 ns 40236329 bytes=2.9333G/s error=0 matches=6 pairs=2.86455M/s
intersect_u32_neon<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1 342 ns 342 ns 40748411 bytes=2.9898G/s error=0 matches=64 pairs=2.91972M/s
intersect_u32_neon<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1 298 ns 298 ns 46791238 bytes=3.43996G/s error=0 matches=121 pairs=3.35933M/s
intersect_u32_neon<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1 1362 ns 1362 ns 10136110 bytes=3.38332G/s error=0 matches=1 pairs=734.227k/s
intersect_u32_neon<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1 1362 ns 1362 ns 9905936 bytes=3.38259G/s error=0 matches=6 pairs=734.07k/s
intersect_u32_neon<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1 1438 ns 1438 ns 9581767 bytes=3.20387G/s error=0 matches=64 pairs=695.283k/s
intersect_u32_neon<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1 1506 ns 1506 ns 9142785 bytes=3.06057G/s error=0 matches=121 pairs=664.186k/s
intersect_u32_neon<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1 2620 ns 2620 ns 5243539 bytes=12.7033G/s error=0 matches=1 pairs=381.711k/s
intersect_u32_neon<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1 2588 ns 2588 ns 5254416 bytes=12.8586G/s error=0 matches=6 pairs=386.375k/s
intersect_u32_neon<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1 2649 ns 2649 ns 5279233 bytes=12.5632G/s error=0 matches=64 pairs=377.5k/s
intersect_u32_neon<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1 2662 ns 2662 ns 5302739 bytes=12.5016G/s error=0 matches=121 pairs=375.649k/s
intersect_u32_neon<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1 3001 ns 3001 ns 4635267 bytes=2.72975G/s error=0 matches=10 pairs=333.221k/s
intersect_u32_neon<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1 2994 ns 2994 ns 4653077 bytes=2.73625G/s error=0 matches=51 pairs=334.014k/s
intersect_u32_neon<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1 2885 ns 2885 ns 4857437 bytes=2.83903G/s error=0 matches=512 pairs=346.561k/s
intersect_u32_neon<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1 2445 ns 2445 ns 5727964 bytes=3.35091G/s error=0 matches=972 pairs=409.047k/s
intersect_u32_neon<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1 12045 ns 12045 ns 1154950 bytes=3.06049G/s error=0 matches=10 pairs=83.021k/s
intersect_u32_neon<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1 12080 ns 12080 ns 1146339 bytes=3.05164G/s error=0 matches=51 pairs=82.7811k/s
intersect_u32_neon<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1 12567 ns 12567 ns 1110004 bytes=2.93331G/s error=0 matches=512 pairs=79.571k/s
intersect_u32_neon<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1 13068 ns 13068 ns 1067483 bytes=2.82091G/s error=0 matches=972 pairs=76.5222k/s
intersect_u32_sve2<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1 347 ns 347 ns 40622319 bytes=2.95225G/s error=0 matches=1 pairs=2.88305M/s
intersect_u32_sve2<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1 343 ns 343 ns 40350774 bytes=2.9851G/s error=0 matches=6 pairs=2.91514M/s
intersect_u32_sve2<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1 340 ns 340 ns 40523956 bytes=3.01281G/s error=0 matches=64 pairs=2.94219M/s
intersect_u32_sve2<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1 291 ns 291 ns 48223598 bytes=3.51391G/s error=0 matches=121 pairs=3.43156M/s
intersect_u32_sve2<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1 1312 ns 1312 ns 10433130 bytes=3.51351G/s error=0 matches=1 pairs=762.48k/s
intersect_u32_sve2<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1 1319 ns 1319 ns 10470548 bytes=3.49244G/s error=0 matches=6 pairs=757.908k/s
intersect_u32_sve2<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1 1363 ns 1363 ns 9973908 bytes=3.38024G/s error=0 matches=64 pairs=733.56k/s
intersect_u32_sve2<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1 1421 ns 1421 ns 9429782 bytes=3.24226G/s error=0 matches=121 pairs=703.616k/s
intersect_u32_sve2<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1 3972 ns 3972 ns 3517901 bytes=8.37959G/s error=0 matches=1 pairs=251.791k/s
intersect_u32_sve2<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1 3967 ns 3967 ns 3512859 bytes=8.39006G/s error=0 matches=6 pairs=252.105k/s
intersect_u32_sve2<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1 3982 ns 3982 ns 3495200 bytes=8.35755G/s error=0 matches=64 pairs=251.128k/s
intersect_u32_sve2<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1 4011 ns 4011 ns 3491284 bytes=8.29701G/s error=0 matches=121 pairs=249.309k/s
intersect_u32_sve2<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1 2907 ns 2907 ns 4744735 bytes=2.81786G/s error=0 matches=10 pairs=343.977k/s
intersect_u32_sve2<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1 2930 ns 2930 ns 4768146 bytes=2.79595G/s error=0 matches=51 pairs=341.303k/s
intersect_u32_sve2<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1 2770 ns 2770 ns 5049316 bytes=2.95749G/s error=0 matches=512 pairs=361.022k/s
intersect_u32_sve2<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1 2361 ns 2361 ns 5933107 bytes=3.46948G/s error=0 matches=972 pairs=423.521k/s
intersect_u32_sve2<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1 12742 ns 12742 ns 1095276 bytes=2.89313G/s error=0 matches=10 pairs=78.4813k/s
intersect_u32_sve2<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1 12838 ns 12838 ns 1094396 bytes=2.87149G/s error=0 matches=51 pairs=77.894k/s
intersect_u32_sve2<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1 13336 ns 13336 ns 1052936 bytes=2.76422G/s error=0 matches=512 pairs=74.9842k/s
intersect_u32_sve2<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1 13818 ns 13818 ns 1012086 bytes=2.66786G/s error=0 matches=972 pairs=72.3702k/s
intersect_u32_serial<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1 607 ns 607 ns 23280255 bytes=1.68717G/s error=0 matches=1 pairs=1.64763M/s
intersect_u32_serial<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1 604 ns 604 ns 23213860 bytes=1.69627G/s error=0 matches=6 pairs=1.65651M/s
intersect_u32_serial<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1 615 ns 615 ns 22758218 bytes=1.66488G/s error=0 matches=64 pairs=1.62586M/s
intersect_u32_serial<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1 664 ns 664 ns 21163701 bytes=1.54276G/s error=0 matches=121 pairs=1.5066M/s
intersect_u32_serial<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1 2586 ns 2586 ns 5406793 bytes=1.78181G/s error=0 matches=1 pairs=386.678k/s
intersect_u32_serial<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1 2576 ns 2576 ns 5413131 bytes=1.78908G/s error=0 matches=6 pairs=388.256k/s
intersect_u32_serial<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1 2614 ns 2572 ns 5430697 bytes=1.79153G/s error=0 matches=64 pairs=388.787k/s
intersect_u32_serial<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1 2589 ns 2578 ns 5420325 bytes=1.78724G/s error=0 matches=121 pairs=387.856k/s
intersect_u32_serial<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1 4370 ns 4370 ns 3210747 bytes=7.6164G/s error=0 matches=1 pairs=228.858k/s
intersect_u32_serial<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1 4344 ns 4344 ns 3184249 bytes=7.66047G/s error=0 matches=6 pairs=230.182k/s
intersect_u32_serial<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1 4361 ns 4361 ns 3211661 bytes=7.63195G/s error=0 matches=64 pairs=229.326k/s
intersect_u32_serial<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1 4370 ns 4370 ns 3174111 bytes=7.61502G/s error=0 matches=121 pairs=228.817k/s
intersect_u32_serial<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1 4562 ns 4562 ns 3067578 bytes=1.79582G/s error=0 matches=10 pairs=219.217k/s
intersect_u32_serial<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1 4550 ns 4550 ns 3068627 bytes=1.80043G/s error=0 matches=51 pairs=219.78k/s
intersect_u32_serial<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1 4558 ns 4558 ns 3076772 bytes=1.79728G/s error=0 matches=512 pairs=219.395k/s
intersect_u32_serial<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1 4490 ns 4491 ns 3117760 bytes=1.82428G/s error=0 matches=972 pairs=222.69k/s
intersect_u32_serial<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1 20310 ns 20311 ns 689896 bytes=1.81501G/s error=0 matches=10 pairs=49.2352k/s
intersect_u32_serial<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1 20276 ns 20276 ns 689898 bytes=1.81808G/s error=0 matches=51 pairs=49.3186k/s
intersect_u32_serial<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1 20289 ns 20289 ns 689888 bytes=1.81692G/s error=0 matches=512 pairs=49.2872k/s
intersect_u32_serial<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1 20311 ns 20312 ns 689865 bytes=1.81493G/s error=0 matches=972 pairs=49.233k/s
Describe what you are looking for
To see if intersect_u32_sve2 can be improved with svhistcnt
Can you contribute to the implementation?
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
Code of Conduct