Closed mayeut closed 7 years ago
Seems it's wrong... It can decode invalid input. Please do not merge yet.
OK, all seems good now. I added tests for invalid input so that it does not happen again.
I think there's something wrong in Intel Intrisincs Guide for pcmpistri
or I misunderstood the doc but now SSE4.2 works properly.
Correct me if I'm wrong, but I don't understand where this pull request adds AVX support. As far as I can tell, it only adds SSE42. The codec_avx.c
file is virtually identical to codec_ssse3.c
except for some cosmetic changes, and the codecs it uses are SSSE3 (encode) and SSE42 (decode). I don't see the point of adding that file...
A minor point, personal preference even, but in the interest of keeping commits small and atomic, you could reconsider splitting the commit into:
@aklomp, maybe my comment wasn't clear enough
AVX is just a recompilation of SSSE3 for encoding and SSE4.2 for decoding.
It is recompiled using -mavx so the compiler now uses the 3 operands instructions available instead of legacy 2 operands instructions. This allows for a small speed up by removing some registers copy (roughly 3/4 %).
If you feel it isn't worth it I can remove that.
I can split the commits once I have the answer for the previous point (AVX). I will start another PR for the test harness.
In regard to the AVX point, the thing that bothers me most is the code duplication. There must be a nicer way to fix it. Since the HAVE_AVX
macro is available at compile time, we might be able to use it to compile the same piece of code in two different ways.
We could also solve it by putting something like this in the readme:
If your processor supports AVX instructions, you can speed up the SSE42 codec by using
SSE42_CFLAGS=-mavx
instead ofSSE42_CFLAGS=-msse4.2
.
Then we wouldn't have "official" AVX support in the sense that we have a separately compiled object file for that platform, but purists could get the speed benefits anyway.
I opened #19 for invalid input fix & tests. I'll see what I come up with for the rest.
OK I kept only SSE4.2 for now.
Superseded by #22
SSE4.2 has been added for decoding.
Speed-up on
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
usingApple LLVM version 8.0.0 (clang-800.0.38)
SSE4.2 decoding: +29% compared to SSSE3 in ab7a48bcbc8066e2e200836dd9f40a2b44dda35b
Full results (this builds on top of #17):