aklomp / base64

Fast Base64 stream encoder/decoder in C99, with SIMD acceleration
BSD 2-Clause "Simplified" License
890 stars 165 forks source link

Benchmarks #80

Open htot opened 2 years ago

htot commented 2 years ago

I did some automated benchmarking on my i7-10700 and Edison (Merrifield dual core Silvermont Atom without cache memory, similar to Baytrail) that I want to share here. Strictly, this issue is for reference only. It might be useful to find those commits causing substantial performance increases or decreases. All data have been taken without OpenMP (1 thread only) and in x86_64 mode. On i7 you will see some deviation probably caused by frequency scaling / turbo boost. Don't let that disturb you. Data can be found here if you want to play yourself benchmarks.ods

Below I filter out the most interesting commits.

Encoding

Note that on Edison SSE3 encoding took a hit with 9a0d1b2. encode

# Hash Commit message
24 3f3f31c Fix build under Xcode
30 67ee3fd SSSE3->AVX2 encoding optimization
76 a5b6739 SSSE3: enc: factor encoding loop into inline function
79 99977db Generic64: enc: factor encoding loop into inline function
92 e2c6687 AVX2: enc: unroll inner loop
93 9a0d1b2 SSSE3: enc: unroll inner loop
96 bf7341f Generic64: enc: unroll inner loop
114 b8b3c58 Generic64: enc: use 12-bit lookup table

Decoding

Especially for Edison it has been a bumpy ride, with great improvements 3f3f31c and regressions 0a69845 on SSE3 but also for PLAIN cfa8bf7 and f538baa. decode

# Hash Commit message
24 3f3f31c Fix build under Xcode
29 cfa8bf7 Plain decoding optimization
35 0a69845 SSSE3->AVX2, NEON32 decoding optimization
85 6310c1f SSSE3: dec: factor decoding loop into inline function
88 f538baa Generic32: dec: factor decoding loop into inline function
100 495414b AVX2: dec: unroll inner loop
101 5874921 SSSE3: dec: unroll inner loop
htot commented 2 years ago

@aklomp are these in any way useful?

aklomp commented 2 years ago

@htot Thanks for your work. It's interesting to see that not all "improvements" to the library have led to actual improvements in real-world benchmarks. Which proves that we need to be careful when introducing new tricks, because some users may be worse off. That said, apart from SSSE3, the trend seems to be upward.

The SSSE3 thing could be due to register pressure. I think I saw the same degradation happen on my Atom N270 (a super weird processor, a 32-bit core with up to SSSE3 support) and when I tried to hand-optimize with inline assembler, I found out that that architecture has much less SSE registers available to it than big-boy x86's. Which results in lots of register moves and slow code. I didn't bother much with it because I considered that use case so niche...

I think these benchmarks are cool and might be useful as a jumping-off point for analyzing performance degradations in past commits, but apart from that I don't see a major use for them. The idea of graphing out performance over time is very powerful though, and I'll try to remember it for my toolbox.

htot commented 2 years ago

I think the Atom is like Baytrail (and Edison) a x86_64 CPU, they support SSSE3 but not AVX. The core is Silvermont (SLM) which has a penalty for long 64 bit instructions (complicated story), that might be the case here too (I have not test on i686 mode). If so, goldmont / airmont may behave completely different (but I don't have those here).

My i7-10700 btw appears to have 16MB L3 cache. So above benchmarks are not really usable (typically nobody ever would encode the same string twice). I patched to add a 100MB string and find that in all cases except "plain" we are near the bandwidth limit of the DDR. And even "plain" with openmp reaches bandwidth limit.

All these optimizations are useful in particular on the slow Atoms, but there we had a degradation. I'll add some improvements here, maybe you can label this "not a bug"?

aklomp commented 2 years ago

The Intel Atom N270 really is a 32-bit Diamondville core with SSSE3 extensions, as you can see on Intel's site. It's a very low power (2.5W TDP), passively cooled mobile processor with this weird feature mix for some reason. I've been using it in my home server for the last 12 years. Bit slow at times but gets the job done. Anyway.

I created a "benchmarking" label and added it to this issue. I'll leave it open for the time being, then.

htot commented 2 years ago

This is with 100MB buffer and OPENMP. decode-threaded

Here you see earlier i7 results were optimistic due to the L3 cache (Edison is not affected, it has no cache).

encode-threaded

And something strange here with AVX2 decoding.

Nevertheless looking at the history i7-encode-large

We know that the specialized encoding is much faster than plain, but as they are mostly DDR bandwidth limited, we don't see that. However, plain has seen some nice improvements over time.

i7-decode-large

After the early improvements on decoding not much changed on i7. The decoding performance hit on Edison (Atom, or maybe even unique to Silvermont) is not present on i7. It would be worth reviving the old decoder for when Atom is detected.

htot commented 2 years ago

The Intel Atom N270 really is a 32-bit Diamondville core with SSSE3 extensions, as you can see on Intel's site. It's a very low power (2.5W TDP), passively cooled mobile processor with this weird feature mix for some reason. I've been using it in my home server for the last 12 years. Bit slow at times but gets the job done. Anyway.

I created a "benchmarking" label and added it to this issue. I'll leave it open for the time being, then.

I see. That's confusing, there are also Diamondville CPUs with 64-bit (Atom 230).