ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
913 stars 51 forks source link

Optimizing `substract_bf16x32_genoa` #160

Closed ashvardanian closed 2 weeks ago

ashvardanian commented 3 weeks ago

Can this be reduced to 2x subtractions, 2x shuffles, & 1 blend? image

ashvardanian commented 3 weeks ago

@MarkReedZ, in case you will be looking into this, better to merge into the linked feature branch ;)

MarkReedZ commented 3 weeks ago

Was this what you were thinking? Several instructions shorter, but only 0.5% faster. I like the _mm512_permutex2var_epi16, but your original unpacking was readable.

Godbolt: https://godbolt.org/z/aPYf55s81

    //  The following code is expanding a bf16 packed _m512i to two f32's for the subtraction then
    //  packing them back again.
    __m512i zero = _mm512_setzero_si512();
    __m512i idx_bot = _mm512_set_epi8(
        31, 30,  0,  0, 29, 28,  0,  0, 27, 26,  0,  0, 25, 24,  0,  0,
        23, 22,  0,  0, 21, 20,  0,  0, 19, 18,  0,  0, 17, 16,  0,  0,
        15, 14,  0,  0, 13, 12,  0,  0, 11, 10,  0,  0,  9,  8,  0,  0,
         7,  6,  0,  0,  5,  4,  0,  0,  3,  2,  0,  0,  1,  0,  0,  0
    );
    __m512i idx_top = _mm512_set_epi8(
        63, 62, 0, 0, 61, 60, 0, 0, 59, 58, 0, 0, 57, 56, 0, 0,
        55, 54, 0, 0, 53, 52, 0, 0, 51, 50, 0, 0, 49, 48, 0, 0,
        47, 46, 0, 0, 45, 44, 0, 0, 43, 42, 0, 0, 41, 40, 0, 0,
        39, 38, 0, 0, 37, 36, 0, 0, 35, 34, 0, 0, 33, 32, 0, 0
    );

    __m512i a_top = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_top, a_i16));
    __m512i a_bot = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_bot, a_i16));

    __m512i b_top = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_top, b_i16));
    __m512i b_bot = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_bot, b_i16));

    __m512 d_top = _mm512_sub_ps( _mm512_castsi512_ps(a_top), _mm512_castsi512_ps(b_top) );
    __m512 d_bot = _mm512_sub_ps( _mm512_castsi512_ps(a_bot), _mm512_castsi512_ps(b_bot) );

    __m512i indices2 = _mm512_set_epi16(
        31, 29, 27, 25, 23, 21, 19, 17,
        15, 13, 11, 9, 7, 5, 3, 1,
        63, 61, 59, 57, 55, 53, 51, 49,
        47, 45, 43, 41, 39, 37, 35, 33
    );
    return _mm512_permutex2var_epi16( _mm512_castps_si512(d_top), indices2, _mm512_castps_si512(d_bot) );
ashvardanian commented 2 weeks ago

I was thinking about something different, @MarkReedZ. I'll try today.