Closed ashvardanian closed 2 months ago
@MarkReedZ, in case you will be looking into this, better to merge into the linked feature branch ;)
Was this what you were thinking? Several instructions shorter, but only 0.5% faster. I like the _mm512_permutex2var_epi16
, but your original unpacking was readable.
Godbolt: https://godbolt.org/z/aPYf55s81
// The following code is expanding a bf16 packed _m512i to two f32's for the subtraction then
// packing them back again.
__m512i zero = _mm512_setzero_si512();
__m512i idx_bot = _mm512_set_epi8(
31, 30, 0, 0, 29, 28, 0, 0, 27, 26, 0, 0, 25, 24, 0, 0,
23, 22, 0, 0, 21, 20, 0, 0, 19, 18, 0, 0, 17, 16, 0, 0,
15, 14, 0, 0, 13, 12, 0, 0, 11, 10, 0, 0, 9, 8, 0, 0,
7, 6, 0, 0, 5, 4, 0, 0, 3, 2, 0, 0, 1, 0, 0, 0
);
__m512i idx_top = _mm512_set_epi8(
63, 62, 0, 0, 61, 60, 0, 0, 59, 58, 0, 0, 57, 56, 0, 0,
55, 54, 0, 0, 53, 52, 0, 0, 51, 50, 0, 0, 49, 48, 0, 0,
47, 46, 0, 0, 45, 44, 0, 0, 43, 42, 0, 0, 41, 40, 0, 0,
39, 38, 0, 0, 37, 36, 0, 0, 35, 34, 0, 0, 33, 32, 0, 0
);
__m512i a_top = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_top, a_i16));
__m512i a_bot = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_bot, a_i16));
__m512i b_top = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_top, b_i16));
__m512i b_bot = _mm512_mask_blend_epi8(0xCCCCCCCCCCCCCCCC, zero, _mm512_permutexvar_epi8 (idx_bot, b_i16));
__m512 d_top = _mm512_sub_ps( _mm512_castsi512_ps(a_top), _mm512_castsi512_ps(b_top) );
__m512 d_bot = _mm512_sub_ps( _mm512_castsi512_ps(a_bot), _mm512_castsi512_ps(b_bot) );
__m512i indices2 = _mm512_set_epi16(
31, 29, 27, 25, 23, 21, 19, 17,
15, 13, 11, 9, 7, 5, 3, 1,
63, 61, 59, 57, 55, 53, 51, 49,
47, 45, 43, 41, 39, 37, 35, 33
);
return _mm512_permutex2var_epi16( _mm512_castps_si512(d_top), indices2, _mm512_castps_si512(d_bot) );
I was thinking about something different, @MarkReedZ. I'll try today.
Can this be reduced to 2x subtractions, 2x shuffles, & 1 blend?