ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
988 stars 59 forks source link

2x Faster `bf16` Euclidean Distance on AMD Genoa #180

Closed ashvardanian closed 2 months ago

ashvardanian commented 2 months ago

Performance Improvements

The older bf16 Euclidean Distance implementation had an inefficient implementation for mixed-precision vector subtractions. The new one is very similar, but avoids a couple of serial operations and doubles the throughput:

SIMSIMD_INTERNAL __m512i simsimd_substract_bf16x32_genoa(__m512i a_i16, __m512i b_i16) {
    union {
        __m512 fvec;
        __m512i ivec;
        simsimd_f32_t f32[16];
        simsimd_u16_t u16[32];
        simsimd_bf16_t bf16[32];
    } d_odd, d_even, d, a_f32_even, b_f32_even, d_f32_even, a_f32_odd, b_f32_odd, d_f32_odd, a, b;
    a.ivec = a_i16;
    b.ivec = b_i16;
    a_f32_odd.ivec = _mm512_and_si512(a_i16, _mm512_set1_epi32(0xFFFF0000));
    a_f32_even.ivec = _mm512_slli_epi32(a_i16, 16);
    b_f32_odd.ivec = _mm512_and_si512(b_i16, _mm512_set1_epi32(0xFFFF0000));
    b_f32_even.ivec = _mm512_slli_epi32(b_i16, 16);
    d_f32_odd.fvec = _mm512_sub_ps(a_f32_odd.fvec, b_f32_odd.fvec);
    d_f32_even.fvec = _mm512_sub_ps(a_f32_even.fvec, b_f32_even.fvec);
    d_f32_even.ivec = _mm512_srli_epi32(d_f32_even.ivec, 16);
    d.ivec = _mm512_mask_blend_epi16(0x55555555, d_f32_odd.ivec, d_f32_even.ivec);
    return d.ivec;
}

Old Implementation

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_haswell_128d/min_time:10.000/threads:1       15.2 ns         15.2 ns    890417569 abs_delta=41.5895n bytes=33.6296G/s pairs=65.6828M/s relative_error=20.6195n
l2sq_bf16_genoa_128d/min_time:10.000/threads:1         16.3 ns         16.3 ns    867745590 abs_delta=7.74925m bytes=31.3522G/s pairs=61.2348M/s relative_error=3.87658m
l2sq_bf16_serial_128d/min_time:10.000/threads:1         599 ns          599 ns     23382373 abs_delta=489.092n bytes=855.039M/s pairs=1.67M/s relative_error=244.952n

New Implementation

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
l2sq_bf16_haswell_128d/min_time:10.000/threads:1       14.7 ns         14.7 ns    952634662 abs_delta=37.4399n bytes=34.7926G/s pairs=67.9544M/s relative_error=18.9709n
l2sq_bf16_genoa_128d/min_time:10.000/threads:1         8.45 ns         8.45 ns   1000000000 abs_delta=7.70743m bytes=60.5856G/s pairs=118.331M/s relative_error=3.85599m
l2sq_bf16_serial_128d/min_time:10.000/threads:1         599 ns          599 ns     23376884 abs_delta=471.586n bytes=854.885M/s pairs=1.6697M/s relative_error=236.592n

Other Changes