aklomp / base64

Fast Base64 stream encoder/decoder in C99, with SIMD acceleration
BSD 2-Clause "Simplified" License
868 stars 162 forks source link

Add SSE4.2 code paths #18

Closed mayeut closed 7 years ago

mayeut commented 7 years ago

SSE4.2 has been added for decoding.

Speed-up on Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz using Apple LLVM version 8.0.0 (clang-800.0.38)

SSE4.2 decoding: +29% compared to SSSE3 in ab7a48bcbc8066e2e200836dd9f40a2b44dda35b

Full results (this builds on top of #17):

Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 100 * 1
AVX2    encode  7876.40 MB/sec
AVX2    decode  7047.91 MB/sec
plain   encode  1499.65 MB/sec
plain   decode  1557.21 MB/sec
SSSE3   encode  5612.09 MB/sec
SSSE3   decode  3883.52 MB/sec
SSE42   encode  5600.26 MB/sec
SSE42   decode  5089.87 MB/sec
Testing with buffer size 1 MB, fastest of 100 * 10
AVX2    encode  8970.38 MB/sec
AVX2    decode  7093.07 MB/sec
plain   encode  1501.03 MB/sec
plain   decode  1562.29 MB/sec
SSSE3   encode  5681.83 MB/sec
SSSE3   decode  3902.85 MB/sec
SSE42   encode  5681.28 MB/sec
SSE42   decode  5108.25 MB/sec
Testing with buffer size 100 KB, fastest of 100 * 100
AVX2    encode  9010.17 MB/sec
AVX2    decode  7036.27 MB/sec
plain   encode  1500.42 MB/sec
plain   decode  1561.01 MB/sec
SSSE3   encode  5698.98 MB/sec
SSSE3   decode  3901.21 MB/sec
SSE42   encode  5698.52 MB/sec
SSE42   decode  5103.18 MB/sec
Testing with buffer size 10 KB, fastest of 1000 * 100
AVX2    encode  8928.00 MB/sec
AVX2    decode  6923.62 MB/sec
plain   encode  1504.46 MB/sec
plain   decode  1558.74 MB/sec
SSSE3   encode  5672.21 MB/sec
SSSE3   decode  3912.51 MB/sec
SSE42   encode  5674.36 MB/sec
SSE42   decode  5078.06 MB/sec
Testing with buffer size 1 KB, fastest of 1000 * 1000
AVX2    encode  6726.56 MB/sec
AVX2    decode  5761.74 MB/sec
plain   encode  1420.93 MB/sec
plain   decode  1519.19 MB/sec
SSSE3   encode  5090.29 MB/sec
SSSE3   decode  3604.62 MB/sec
SSE42   encode  5023.86 MB/sec
SSE42   decode  4496.14 MB/sec
mayeut commented 7 years ago

Seems it's wrong... It can decode invalid input. Please do not merge yet.

mayeut commented 7 years ago

OK, all seems good now. I added tests for invalid input so that it does not happen again.

I think there's something wrong in Intel Intrisincs Guide for pcmpistri or I misunderstood the doc but now SSE4.2 works properly.

aklomp commented 7 years ago

Correct me if I'm wrong, but I don't understand where this pull request adds AVX support. As far as I can tell, it only adds SSE42. The codec_avx.c file is virtually identical to codec_ssse3.c except for some cosmetic changes, and the codecs it uses are SSSE3 (encode) and SSE42 (decode). I don't see the point of adding that file...

aklomp commented 7 years ago

A minor point, personal preference even, but in the interest of keeping commits small and atomic, you could reconsider splitting the commit into:

mayeut commented 7 years ago

@aklomp, maybe my comment wasn't clear enough

AVX is just a recompilation of SSSE3 for encoding and SSE4.2 for decoding.

It is recompiled using -mavx so the compiler now uses the 3 operands instructions available instead of legacy 2 operands instructions. This allows for a small speed up by removing some registers copy (roughly 3/4 %).

If you feel it isn't worth it I can remove that.

mayeut commented 7 years ago

I can split the commits once I have the answer for the previous point (AVX). I will start another PR for the test harness.

aklomp commented 7 years ago

In regard to the AVX point, the thing that bothers me most is the code duplication. There must be a nicer way to fix it. Since the HAVE_AVX macro is available at compile time, we might be able to use it to compile the same piece of code in two different ways.

We could also solve it by putting something like this in the readme:

If your processor supports AVX instructions, you can speed up the SSE42 codec by using SSE42_CFLAGS=-mavx instead of SSE42_CFLAGS=-msse4.2.

Then we wouldn't have "official" AVX support in the sense that we have a separately compiled object file for that platform, but purists could get the speed benefits anyway.

mayeut commented 7 years ago

I opened #19 for invalid input fix & tests. I'll see what I come up with for the rest.

mayeut commented 7 years ago

OK I kept only SSE4.2 for now.

mayeut commented 7 years ago

Superseded by #22