Closed aqrit closed 5 years ago
@aqrit Thank you. Taking into account how SSE2 is limited, it's really impressive. I need some time to get a grasp of all trick you use. Good job. :)
@aqrit Will translate it to AVX512BW and see how it behaves. It's really, really neat. Love it. Thanks again for sharing.
@aqrit OK, finally I managed to translate your code into AVX512BW. For SSE2 it's faster, but the AVX512BW variant is unfortunately slower:
AVX512BW (lookup: N/A, pack: multiply-add) : 0.140 cycle/op (best) 0.144 cycle/op (avg)
AVX512BW (lookup: aqrit, pack: multiply-add) : 0.165 cycle/op (best) 0.170 cycle/op (avg)
I must say that it took me way too much time. Sorry for this, it's a shame.
IIRC, in the AVX2/SSE4.1 methods the lower_nibble
doesn't have to be isolated. If the hi-bit is set then pshufb outputs a zero.... so If the table for the lower nibble is inverted then at the end one could just testc
instead of testz
. I never submitted at patch for this as I assume the extra AND
doesn't cost much.
The SSSE3 rewrite was a win because it simplifies the error checking (as testz
is not available).
*edit: and I think _mm_movemask_epi8, cmp, jz
saves 1 cycle on most cpu's compared to ptest, jz
After I did the "avg/adds" SSSE3 method, I realized that I had seen that technique before... it was explained in some blog post implementing a isalnum()
like method (which now I cant find... wtf google).
Rudimentary SSE2 decoder in NASM, if anyone is interested. Not benchmarked. https://gist.github.com/aqrit/4e33614c64b5be81d88b6a630eb77731
nerd sniped from /r/programming