Closed aqrit closed 7 years ago
Thanks, it is really clever. However, you're still stuck with a kind of range checking, although done with saturated add. I'll be happy to compare performance of your approach, though. Would you provide a C++ implementation?
BTW recently I came up with method with involve a single bit-and to find out invalid input char: https://github.com/WojciechMula/base64simd/blob/master/decode/lookup.sse.cpp (function lookup_pshufb_bitmask).
I will do a version with C++ intrinsics eventually.
The difference between the saturation method and the bitmask method will probably be insignificant when using only SSE3 instructions.
However with SSE4.1, lookup_pshufb_bitmask() needs to be reworked to place the _mm_movemask_epi8()
behind a _mm_testz_si128()
. If that can't be done then AFAIK it will be noticeably slower than the saturation method.
According to IACA, the unrolled sse3 saturation method will 'lookup' 64 bytes in 22.0 cycles on Nehalem, Westmere, and Sandy Bridge.
Thank you, that's really interesting. I'm curious if your approach is faster on Skylake -- @lemire and I are working now on AVX2-only library (https://github.com/lemire/fastbase64). If you prepare an intrinsics version we will be happy to test it.
+1
bitmap method w/testz
const __m128i lut_lo = _mm_setr_epi8(
0x15, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11,
0x11, 0x11, 0x13, 0x1A, 0x1B, 0x1B, 0x1B, 0x1A
);
const __m128i lut_hi = _mm_setr_epi8(
0x10, 0x10, 0x01, 0x02, 0x04, 0x08, 0x04, 0x08,
0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10
);
const __m128i lut_roll = _mm_setr_epi8(
0, 16, 19, 4, -65, -65, -71, -71,
0, 0, 0, 0, 0, 0, 0, 0
);
const __m128i mask_2F = _mm_set1_epi8( 0x2F );
// lookup
__m128i hi_nibbles, lo_nibbles, lo, hi, roll, eq_2F;
hi_nibbles = _mm_srli_epi32( values, 4 );
lo_nibbles = _mm_and_si128( values, mask_2F );
lo = _mm_shuffle_epi8( lut_lo, lo_nibbles );
eq_2F = _mm_cmpeq_epi8( values, mask_2F );
hi_nibbles = _mm_and_si128( hi_nibbles, mask_2F );
hi = _mm_shuffle_epi8( lut_hi, hi_nibbles );
roll = _mm_shuffle_epi8( lut_roll, _mm_add_epi8( eq_2F, hi_nibbles ) );
if( ! _mm_testz_si128( lo, hi ) ) goto invalid_char;
values = _mm_add_epi8( values, roll );
the saturation method might save a cycle or so when unrolled... but other than that it seems pointless.
@aqrit Thank you very much, will check it next week and ping you back.
@aqrit We are working on a write-up for this code and we would like to credit you for this clever idea. You appear to be using the Web anonymously, under a pseudonym, which is fine, but it makes formally giving credit difficult. Would you give us a name? Short of that, if you are willing to email your name, please do so at lemire@gmail.com or wojciech_mula@poczta.onet.pl.
the obvious extension of the saturation method to pshufb luts... it is a few instructions shorter than the current sse4 method and AFAIK should be faster, if unrolled.