sse3 lookup - Githubissues

aqrit commented 7 years ago

the obvious extension of the saturation method to pshufb luts... it is a few instructions shorter than the current sse4 method and AFAIK should be faster, if unrolled.

;;; sse3 lookup ;;;
; xmm0  = in/out
; xmm12 = 0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,0x2F,[...]  // mask_2F
; xmm13 = 0x01,0x50,0x54,0x46,0x25,0x25,0x05,0x05,[...]  // lut_roll_up
; xmm14 = 0xFF,0x7F,0x7F,0x76,0x66,0x66,0x66,0x66,[...]  // lut_roll_down
; xmm15 = 0x00,0x3F,0x3E,0x34,0x00,0x00,0x1A,0x1A,[junk] // lut_shift
;
;;; perfect hash for lut = ((src>>4)&0x2F)+((src==0x2F)?0xFF:0x00)
;;; 0000 = garbage
;;; 0001 = slash
;;; 0010 = plus
;;; 0011 = 0-9
;;; 0100 = A-Z
;;; 0101 = A-Z
;;; 0110 = a-z
;;; 0111 = a-z
;;; 1000 >= garbage

movdqa   xmm2, xmm12
movdqa   xmm3, xmm13
movdqa   xmm4, xmm14
movdqa   xmm5, xmm15
movdqa   xmm1, xmm0
psrld    xmm1, 4
pcmpeqb  xmm2, xmm0
pand     xmm1, xmm12
paddb    xmm1, xmm2
pshufb   xmm3, xmm1
pshufb   xmm4, xmm1
pshufb   xmm5, xmm1
paddb    xmm0, xmm3
psubsb   xmm0, xmm4
paddusb  xmm0, xmm5
pmovmskb  eax, xmm0

WojciechMula commented 7 years ago

Thanks, it is really clever. However, you're still stuck with a kind of range checking, although done with saturated add. I'll be happy to compare performance of your approach, though. Would you provide a C++ implementation?

BTW recently I came up with method with involve a single bit-and to find out invalid input char: https://github.com/WojciechMula/base64simd/blob/master/decode/lookup.sse.cpp (function lookup_pshufb_bitmask).

aqrit commented 7 years ago

I will do a version with C++ intrinsics eventually.

The difference between the saturation method and the bitmask method will probably be insignificant when using only SSE3 instructions.

However with SSE4.1, lookup_pshufb_bitmask() needs to be reworked to place the _mm_movemask_epi8() behind a _mm_testz_si128(). If that can't be done then AFAIK it will be noticeably slower than the saturation method.

According to IACA, the unrolled sse3 saturation method will 'lookup' 64 bytes in 22.0 cycles on Nehalem, Westmere, and Sandy Bridge.

WojciechMula commented 7 years ago

Thank you, that's really interesting. I'm curious if your approach is faster on Skylake -- @lemire and I are working now on AVX2-only library (https://github.com/lemire/fastbase64). If you prepare an intrinsics version we will be happy to test it.

lemire commented 7 years ago

+1

aqrit commented 7 years ago

bitmap method w/testz

const __m128i lut_lo = _mm_setr_epi8(
    0x15, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 
    0x11, 0x11, 0x13, 0x1A, 0x1B, 0x1B, 0x1B, 0x1A 
);
const __m128i lut_hi = _mm_setr_epi8(
    0x10, 0x10, 0x01, 0x02, 0x04, 0x08, 0x04, 0x08, 
    0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10
);
const __m128i lut_roll = _mm_setr_epi8(
    0,   16,  19,   4, -65, -65, -71, -71,
    0,   0,   0,   0,   0,   0,   0,   0
);
const __m128i mask_2F = _mm_set1_epi8( 0x2F );

// lookup
__m128i hi_nibbles, lo_nibbles, lo, hi, roll, eq_2F;
hi_nibbles = _mm_srli_epi32( values, 4 );
lo_nibbles = _mm_and_si128( values, mask_2F );
lo = _mm_shuffle_epi8( lut_lo, lo_nibbles );
eq_2F = _mm_cmpeq_epi8( values, mask_2F );
hi_nibbles = _mm_and_si128( hi_nibbles, mask_2F );
hi = _mm_shuffle_epi8( lut_hi, hi_nibbles );
roll = _mm_shuffle_epi8( lut_roll, _mm_add_epi8( eq_2F, hi_nibbles ) );
if( ! _mm_testz_si128( lo, hi ) ) goto invalid_char;
values = _mm_add_epi8( values, roll );

the saturation method might save a cycle or so when unrolled... but other than that it seems pointless.

WojciechMula commented 7 years ago

@aqrit Thank you very much, will check it next week and ping you back.

lemire commented 7 years ago

@aqrit We are working on a write-up for this code and we would like to credit you for this clever idea. You appear to be using the Web anonymously, under a pseudonym, which is fine, but it makes formally giving credit difficult. Would you give us a name? Short of that, if you are willing to email your name, please do so at lemire@gmail.com or wojciech_mula@poczta.onet.pl.

WojciechMula / base64simd

sse3 lookup #3