Open htot opened 2 years ago
@aklomp are these in any way useful?
@htot Thanks for your work. It's interesting to see that not all "improvements" to the library have led to actual improvements in real-world benchmarks. Which proves that we need to be careful when introducing new tricks, because some users may be worse off. That said, apart from SSSE3
, the trend seems to be upward.
The SSSE3
thing could be due to register pressure. I think I saw the same degradation happen on my Atom N270 (a super weird processor, a 32-bit core with up to SSSE3 support) and when I tried to hand-optimize with inline assembler, I found out that that architecture has much less SSE registers available to it than big-boy x86's. Which results in lots of register moves and slow code. I didn't bother much with it because I considered that use case so niche...
I think these benchmarks are cool and might be useful as a jumping-off point for analyzing performance degradations in past commits, but apart from that I don't see a major use for them. The idea of graphing out performance over time is very powerful though, and I'll try to remember it for my toolbox.
I think the Atom is like Baytrail (and Edison) a x86_64 CPU, they support SSSE3 but not AVX. The core is Silvermont (SLM) which has a penalty for long 64 bit instructions (complicated story), that might be the case here too (I have not test on i686 mode). If so, goldmont / airmont may behave completely different (but I don't have those here).
My i7-10700 btw appears to have 16MB L3 cache. So above benchmarks are not really usable (typically nobody ever would encode the same string twice). I patched to add a 100MB string and find that in all cases except "plain" we are near the bandwidth limit of the DDR. And even "plain" with openmp reaches bandwidth limit.
All these optimizations are useful in particular on the slow Atoms, but there we had a degradation. I'll add some improvements here, maybe you can label this "not a bug"?
The Intel Atom N270 really is a 32-bit Diamondville core with SSSE3
extensions, as you can see on Intel's site. It's a very low power (2.5W TDP), passively cooled mobile processor with this weird feature mix for some reason. I've been using it in my home server for the last 12 years. Bit slow at times but gets the job done. Anyway.
I created a "benchmarking" label and added it to this issue. I'll leave it open for the time being, then.
This is with 100MB buffer and OPENMP.
Here you see earlier i7 results were optimistic due to the L3 cache (Edison is not affected, it has no cache).
And something strange here with AVX2 decoding.
Nevertheless looking at the history
We know that the specialized encoding is much faster than plain, but as they are mostly DDR bandwidth limited, we don't see that. However, plain has seen some nice improvements over time.
After the early improvements on decoding not much changed on i7. The decoding performance hit on Edison (Atom, or maybe even unique to Silvermont) is not present on i7. It would be worth reviving the old decoder for when Atom is detected.
The Intel Atom N270 really is a 32-bit Diamondville core with
SSSE3
extensions, as you can see on Intel's site. It's a very low power (2.5W TDP), passively cooled mobile processor with this weird feature mix for some reason. I've been using it in my home server for the last 12 years. Bit slow at times but gets the job done. Anyway.I created a "benchmarking" label and added it to this issue. I'll leave it open for the time being, then.
I see. That's confusing, there are also Diamondville CPUs with 64-bit (Atom 230).
I did some automated benchmarking on my i7-10700 and Edison (Merrifield dual core Silvermont Atom without cache memory, similar to Baytrail) that I want to share here. Strictly, this issue is for reference only. It might be useful to find those commits causing substantial performance increases or decreases. All data have been taken without OpenMP (1 thread only) and in x86_64 mode. On i7 you will see some deviation probably caused by frequency scaling / turbo boost. Don't let that disturb you. Data can be found here if you want to play yourself benchmarks.ods
Below I filter out the most interesting commits.
Encoding
Note that on Edison SSE3 encoding took a hit with 9a0d1b2.
Decoding
Especially for Edison it has been a bumpy ride, with great improvements 3f3f31c and regressions 0a69845 on SSE3 but also for PLAIN cfa8bf7 and f538baa.