ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
988 stars 59 forks source link

Faster`substract_bf16x32_genoa` mixed-precision subtraction #161

Closed ashvardanian closed 2 months ago

ashvardanian commented 2 months ago

Relates to #160

ashvardanian commented 2 months ago

@MarkReedZ, what kind of timing are you getting in this? permutex extensions are quite expensive. I've just got 2x improvement by avoiding inserts and casts in the end.

Old Implementation on main-dev

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_haswell_128d/min_time:10.000/threads:1       15.2 ns         15.2 ns    890417569 abs_delta=41.5895n bytes=33.6296G/s pairs=65.6828M/s relative_error=20.6195n
l2sq_bf16_genoa_128d/min_time:10.000/threads:1         16.3 ns         16.3 ns    867745590 abs_delta=7.74925m bytes=31.3522G/s pairs=61.2348M/s relative_error=3.87658m
l2sq_bf16_serial_128d/min_time:10.000/threads:1         599 ns          599 ns     23382373 abs_delta=489.092n bytes=855.039M/s pairs=1.67M/s relative_error=244.952n

New Implementation

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_haswell_128d/min_time:10.000/threads:1       14.7 ns         14.7 ns    952634662 abs_delta=37.4399n bytes=34.7926G/s pairs=67.9544M/s relative_error=18.9709n
l2sq_bf16_genoa_128d/min_time:10.000/threads:1         8.45 ns         8.45 ns   1000000000 abs_delta=7.70743m bytes=60.5856G/s pairs=118.331M/s relative_error=3.85599m
l2sq_bf16_serial_128d/min_time:10.000/threads:1         599 ns          599 ns     23376884 abs_delta=471.586n bytes=854.885M/s pairs=1.6697M/s relative_error=236.592n
MarkReedZ commented 2 months ago

The permutex was 9.2. LOL I spent a half hour trying to do it your way and completely borked on the blend missing how simple this whole thing was.

    d_f32_even.ivec = _mm512_srli_epi32(d_f32_even.ivec, 16);
    d.ivec = _mm512_mask_blend_epi16(0x55555555, d_f32_odd.ivec, d_f32_even.ivec);

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_genoa_128d/min_time:10.000/threads:1       9.20 ns         9.19 ns   1000000000 abs_delta=7.80224m