Faster`substract_bf16x32_genoa` mixed-precision subtraction

ashvardanian commented 2 months ago

Relates to #160

ashvardanian commented 2 months ago

@MarkReedZ, what kind of timing are you getting in this? permutex extensions are quite expensive. I've just got 2x improvement by avoiding inserts and casts in the end.

Old Implementation on `main-dev`

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_haswell_128d/min_time:10.000/threads:1       15.2 ns         15.2 ns    890417569 abs_delta=41.5895n bytes=33.6296G/s pairs=65.6828M/s relative_error=20.6195n
l2sq_bf16_genoa_128d/min_time:10.000/threads:1         16.3 ns         16.3 ns    867745590 abs_delta=7.74925m bytes=31.3522G/s pairs=61.2348M/s relative_error=3.87658m
l2sq_bf16_serial_128d/min_time:10.000/threads:1         599 ns          599 ns     23382373 abs_delta=489.092n bytes=855.039M/s pairs=1.67M/s relative_error=244.952n

New Implementation

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_haswell_128d/min_time:10.000/threads:1       14.7 ns         14.7 ns    952634662 abs_delta=37.4399n bytes=34.7926G/s pairs=67.9544M/s relative_error=18.9709n
l2sq_bf16_genoa_128d/min_time:10.000/threads:1         8.45 ns         8.45 ns   1000000000 abs_delta=7.70743m bytes=60.5856G/s pairs=118.331M/s relative_error=3.85599m
l2sq_bf16_serial_128d/min_time:10.000/threads:1         599 ns          599 ns     23376884 abs_delta=471.586n bytes=854.885M/s pairs=1.6697M/s relative_error=236.592n

MarkReedZ commented 2 months ago

The permutex was 9.2. LOL I spent a half hour trying to do it your way and completely borked on the blend missing how simple this whole thing was.

    d_f32_even.ivec = _mm512_srli_epi32(d_f32_even.ivec, 16);
    d.ivec = _mm512_mask_blend_epi16(0x55555555, d_f32_odd.ivec, d_f32_even.ivec);

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_genoa_128d/min_time:10.000/threads:1       9.20 ns         9.19 ns   1000000000 abs_delta=7.80224m

ashvardanian / SimSIMD

Faster`substract_bf16x32_genoa` mixed-precision subtraction #161

Old Implementation on `main-dev`

New Implementation

ashvardanian / SimSIMD

Faster`substract_bf16x32_genoa` mixed-precision subtraction #161

Old Implementation on main-dev

New Implementation

Old Implementation on `main-dev`