Closed ashvardanian closed 2 months ago
@MarkReedZ, what kind of timing are you getting in this? permutex
extensions are quite expensive. I've just got 2x improvement by avoiding inserts and casts in the end.
main-dev
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_haswell_128d/min_time:10.000/threads:1 15.2 ns 15.2 ns 890417569 abs_delta=41.5895n bytes=33.6296G/s pairs=65.6828M/s relative_error=20.6195n
l2sq_bf16_genoa_128d/min_time:10.000/threads:1 16.3 ns 16.3 ns 867745590 abs_delta=7.74925m bytes=31.3522G/s pairs=61.2348M/s relative_error=3.87658m
l2sq_bf16_serial_128d/min_time:10.000/threads:1 599 ns 599 ns 23382373 abs_delta=489.092n bytes=855.039M/s pairs=1.67M/s relative_error=244.952n
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_haswell_128d/min_time:10.000/threads:1 14.7 ns 14.7 ns 952634662 abs_delta=37.4399n bytes=34.7926G/s pairs=67.9544M/s relative_error=18.9709n
l2sq_bf16_genoa_128d/min_time:10.000/threads:1 8.45 ns 8.45 ns 1000000000 abs_delta=7.70743m bytes=60.5856G/s pairs=118.331M/s relative_error=3.85599m
l2sq_bf16_serial_128d/min_time:10.000/threads:1 599 ns 599 ns 23376884 abs_delta=471.586n bytes=854.885M/s pairs=1.6697M/s relative_error=236.592n
The permutex was 9.2. LOL I spent a half hour trying to do it your way and completely borked on the blend missing how simple this whole thing was.
d_f32_even.ivec = _mm512_srli_epi32(d_f32_even.ivec, 16);
d.ivec = _mm512_mask_blend_epi16(0x55555555, d_f32_odd.ivec, d_f32_even.ivec);
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_genoa_128d/min_time:10.000/threads:1 9.20 ns 9.19 ns 1000000000 abs_delta=7.80224m
Relates to #160