Closed mayeut closed 7 years ago
As a side note, there's a bug in clang 3.4 leading to poor results with this implementation: https://bugs.llvm.org//show_bug.cgi?id=18478
This can be seen on the travis-ci clang build which uses clang 3.4
Another side note, speed-ups were measured with 10MB buffer as reported in README.md
.
When using smaller buffers, and thus taking advantage of cache effects, speed-ups are even higher with a special mention for AVX2 which gives more than 30% throughput increase.
Merged, thanks! A very clever trick, this bitshifting-by-multiplication, now that I finally understand it :) Sorry for the delay in merging.
Though this and other contributions have added valuable improvements, I think it might be time for a little cleanup round: fixing comments, style, documentation. I might push something along these lines soon, if I have time.
Use Wojciech Mula (@WojciechMula) implementation update for AVX2 / SSSE3 encoding.
SSSE3 implementation is reused in SSE4.1, SSE4.2 and AVX dispatched encoding loops.
SSE4.1 implementation is now useless but kept to ease integration of future updates if needed.
Speed-up on i7-4870HQ @ 2.5 GHz (clang-800.0.42.1, x86_64) SSSE3 encoding: +20% SSE4.2 encoding: +8% AVX encoding: +7% AVX2 encoding: +3%