aklomp / base64

Fast Base64 stream encoder/decoder in C99, with SIMD acceleration
BSD 2-Clause "Simplified" License
866 stars 162 forks source link

Benchmarks #95

Open htot opened 2 years ago

htot commented 2 years ago

@aklomp @mayeut Again a draft. Please ignore the Benchmarks patch, I was to far to drop that and rebase against HEAD.

The interesting one is codec: add ssse3_atom.

My experience with CRC32C with Silvermont Atom (SLM) processors is that in 64b certain combinations of instructions incur a penalty (see Intel manuals) making the advantage of running in 64b mode negative in some cases. In later Atoms (Goldmont, Airmont) this penalty likely does not occur, but I don't have the hardware to test. Running base64 on SLM shows strange performance regressions while core i7 shows improvement.

So, I revived the best ssse3 codec as ssse3_atom and tested on Intel Edison (dual core 500MHz) in 64b/32b mode (because that is easy to do) and on Intel NUC with Baytrail Atom in 64b (to show the relevancy on main stream CPU).

<!DOCTYPE html>

Min - Speed (MB/sec) | Direction | |   |   |   |   -- | -- | -- | -- | -- | -- | --   | decode |   |   | encode |   |   Processor | plain | SSSE3 | SSSE3_ATOM | plain | SSSE3 | SSSE3_ATOM Atom E3815 @ 1.46GHz (64b) | 326 | 449 | **565** | 441 | 569 | 556 Edison @ 500MHz (32b) | 40 | 102 | 103 | 67 | 111 | 111 Edison @ 500MHz (64b) | 119 | 164 | **206** | 162 | 209 | 204 i7-10700 CPU @ 2.90GHz | 3997 | 9356 | _4685_ | 4387 | 8823 | _7593_

Improvement by going back to the revived codec in bold, degradation in italic.

We see that on i7 the latest version is indeed the fastest, on SLM 32 bit there is no difference. But on SLM 64b SSSE3_ATOM is 25% faster. Now, having a fast algorithm has a much more noticable effect on a slow Atom then on a fast i7... So what do you guys think, should we add a specialized SSSE3 for SLM?

htot commented 2 years ago

@aqrit?

aqrit commented 2 years ago

For dec_loop: #46 is probably faster. Though, it does trade readability for speed.

dec_reshuffle without _mm_madd_epi16 could look like this:

// Pack 16 6-bit values into 12 bytes
// (wasm doesn't have pmaddubsw (but does have pmaddw))
const v128_t shuf = wasm_i8x16_const(2, 1, 0, 6, 5, 4, 10, 9, 8, 14, 13, 12, -1, -1, -1, -1);
v = wasm_v128_or(wasm_i16x8_shr_u(v, 6), wasm_i16x8_shl(v, 8));   // 00cccccc|dddddd00|00aaaaaa|bbbbbb00
v = wasm_v128_or(wasm_i32x4_shr_u(v, 18), wasm_i32x4_shl(v, 10)); // dddd0000|aaaaaabb|bbbbcccc|ccdddddd
v = wasm_i8x16_swizzle(v, shuf);                                  //       ..|ccdddddd|bbbbcccc|aaaaaabb

I'm don't know if it has better latency, but it does have fewer instructions and constants ... edit: in comparision to dec_reshuffle in this PR.

htot commented 2 years ago

Yeah, this draft PR just revives an older version of the codec which showed better performance then currently (on SLM). I didn't try to create my own improvement. PR #46 is a bit older, did you benchmark it at the time on atom?

htot commented 2 years ago

@aqrit would you rebase #46 on master? I'd like to run benchmarks on edison/atom