[FYI] made proof-of-concept sse2 decoder

WojciechMula / base64simd

Base64 coding and decoding with SIMD instructions (SSE/AVX2/AVX512F/AVX512BW/AVX512VBMI/ARM Neon)

http://0x80.pl/articles/index.html#base64-algorithm-new

BSD 2-Clause "Simplified" License

154 stars 13 forks source link

[FYI] made proof-of-concept sse2 decoder #2

Closed aqrit closed 5 years ago

aqrit commented 7 years ago

Rudimentary SSE2 decoder in NASM, if anyone is interested. Not benchmarked. https://gist.github.com/aqrit/4e33614c64b5be81d88b6a630eb77731

nerd sniped from /r/programming

WojciechMula commented 7 years ago

@aqrit Thank you. Taking into account how SSE2 is limited, it's really impressive. I need some time to get a grasp of all trick you use. Good job. :)

WojciechMula commented 5 years ago

@aqrit Will translate it to AVX512BW and see how it behaves. It's really, really neat. Love it. Thanks again for sharing.

WojciechMula commented 5 years ago

@aqrit OK, finally I managed to translate your code into AVX512BW. For SSE2 it's faster, but the AVX512BW variant is unfortunately slower:

AVX512BW (lookup: N/A, pack: multiply-add)  :     0.140 cycle/op (best)    0.144 cycle/op (avg)
AVX512BW (lookup: aqrit, pack: multiply-add)    :     0.165 cycle/op (best)    0.170 cycle/op (avg)

WojciechMula commented 5 years ago

I must say that it took me way too much time. Sorry for this, it's a shame.

aqrit commented 5 years ago

IIRC, in the AVX2/SSE4.1 methods the lower_nibble doesn't have to be isolated. If the hi-bit is set then pshufb outputs a zero.... so If the table for the lower nibble is inverted then at the end one could just testc instead of testz. I never submitted at patch for this as I assume the extra AND doesn't cost much.

The SSSE3 rewrite was a win because it simplifies the error checking (as testz is not available). *edit: and I think _mm_movemask_epi8, cmp, jz saves 1 cycle on most cpu's compared to ptest, jz

After I did the "avg/adds" SSSE3 method, I realized that I had seen that technique before... it was explained in some blog post implementing a isalnum() like method (which now I cant find... wtf google).