WojciechMula / base64simd

Base64 coding and decoding with SIMD instructions (SSE/AVX2/AVX512F/AVX512BW/AVX512VBMI/ARM Neon)
http://0x80.pl/articles/index.html#base64-algorithm-new
BSD 2-Clause "Simplified" License
151 stars 13 forks source link

Benchmarking RVV implementation #9

Open camel-cdr opened 4 months ago

camel-cdr commented 4 months ago

I'm not sure if you have access to RVV hardware, so I thought I'd offer my help in that regard:

C908 encode:

input size: 12582912, output size: 16777216
number of iterations: 10
We report the time in cycles per input byte.
For reference, we present the time needed to copy 12582912 bytes.
rdtsc_overhead set to 7
memcpy                          :     0.515 cycle/op (best)    0.612 cycle/op (avg)
scalar (32 bit)                         ... 0.08162
scalar (64 bit)                         ... 0.07382 (speedup 1.11)
SWAR (64 bit)                           ... 0.07765 (speedup 1.05)
RISC-V RVV (LMUL=1)                     ... 0.07042 (speedup 1.16)
RISC-V RVV (LMUL=8)                     ... 0.05483 (speedup 1.49)
RISC-V Vector (LMUL=4, segmented load)  ... 0.02932 (speedup 2.78)

C908 decode:

scalar                                           ... 0.01204
improved scalar                                  ... 0.01204 (speed-up: 1.00)
RISC-V Vector (pack: gather)                     ... 0.01580 (speed-up: 0.76)
RISC-V Vector (pack: compress)                   ... 0.01399 (speed-up: 0.86)
RISC-V Vector (omit ws; pack: gather)            ... 0.01676 (speed-up: 0.72)
RISC-V Vector (omit ws, pack: compress)          ... 0.01733 (speed-up: 0.69)

C920 encode (without the segmented load one, because I can only compile intrinsics code to xtheadvector, I doubt it would perform well though, compare this and this. I am hopeful that future hardware will perform more similar to the C908):

input size: 100663296, output size: 134217728
number of iterations: 10
We report the time in cycles per input byte.
For reference, we present the time needed to copy 100663296 bytes.
rdtsc_overhead set to 16
memcpy                          :     0.760 cycle/op (best)    0.905 cycle/op (avg)
scalar (32 bit)                         ... 0.56006
scalar (64 bit)                         ... 0.42620 (speedup 1.31)
SWAR (64 bit)                           ... 0.49197 (speedup 1.14)
RISC-V RVV (LMUL=1)                     ... 0.23546 (speedup 2.38)
RISC-V RVV (LMUL=8)                     ... 0.17674 (speedup 3.17)

C920 decode:

scalar                                           ... 0.05389
improved scalar                                  ... 0.05391 (speed-up: 1.00)
RISC-V Vector (pack: gather)                     ... 0.02384 (speed-up: 2.26)
RISC-V Vector (pack: compress)                   ... 0.02149 (speed-up: 2.51)
RISC-V Vector (omit ws; pack: gather)            ... 0.08527 (speed-up: 0.63)
RISC-V Vector (omit ws, pack: compress)          ... 0.08224 (speed-up: 0.66)

Btw, you should add the tail policies to the inline assembly vsetvlis, otherwise it won't compile with clang.

I'll be working on my own implementations, and share them here, so we can figure out what works best.

WojciechMula commented 4 months ago

Thanks! I have no access to any RISC-V hardware, only test validity of code using spike simulator. If you can, please run microbenchmarks using benchamrk_rvv target - it's now compiled in both encode and decode.

camel-cdr commented 3 months ago

@WojciechMula Here you go:

C920:

$ ./decode
input size: 33554432
number of iterations: 100
We report the time in cycles per output byte.
For reference, we present the time needed to copy 25165824 bytes.
rdtsc_overhead set to 16
memcpy                          :     0.469 cycle/op (best)    0.497 cycle/op (avg)
memcpy (RVV)                    :     1.552 cycle/op (best)    1.566 cycle/op (avg)
scalar                          :     3.207 cycle/op (best)    3.216 cycle/op (avg)
improved scalar                 :     3.206 cycle/op (best)    3.216 cycle/op (avg)
RISC-V Vector (pack: gather)    :     1.321 cycle/op (best)    1.325 cycle/op (avg)
RISC-V Vector (pack: compress)  :     1.133 cycle/op (best)    1.135 cycle/op (avg)
RISC-V Vector (omit ws; pack: gather)   :     5.674 cycle/op (best)    5.700 cycle/op (avg)
RISC-V Vector (omit ws, pack: compress) :     5.541 cycle/op (best)    5.562 cycle/op (avg)
$ ./encode
input size: 3072, output size: 4096
number of iterations: 100
We report the time in cycles per input byte.
For reference, we present the time needed to copy 3072 bytes.
rdtsc_overhead set to 16
memcpy                          :     0.176 cycle/op (best)    0.196 cycle/op (avg)
memcpy (RVV)                    :     0.135 cycle/op (best)    0.137 cycle/op (avg)
scalar (32 bit)                 :    11.007 cycle/op (best)   11.278 cycle/op (avg)
scalar (64 bit)                 :     8.535 cycle/op (best)    8.551 cycle/op (avg)
SWAR (64 bit)                   :    10.191 cycle/op (best)   10.195 cycle/op (avg)
RISC-V RVV (LMUL=1)             :     4.425 cycle/op (best)    4.586 cycle/op (avg)
RISC-V RVV (LMUL=8)             :     1.438 cycle/op (best)    1.442 cycle/op (avg)

Note, the libc memcpy already uses RVV, but the C920 has a problem with LMUL=8 read/writes when the data isn't in cache, see my memcpy benchmark. (No other implementation I know about has this problem)

I also couldn't compile encode_strided_load_m8, because GCCs xtheadvector support still has some bugs in it, and this specific code failed to compile.

C908:

$ ./decode
input size: 4194304
number of iterations: 100
We report the time in cycles per output byte.
For reference, we present the time needed to copy 3145728 bytes.
rdtsc_overhead set to 7
memcpy                          :     0.517 cycle/op (best)    0.535 cycle/op (avg)
memcpy (RVV)                    :     0.516 cycle/op (best)    0.524 cycle/op (avg)
scalar                          :     5.447 cycle/op (best)    5.477 cycle/op (avg)
improved scalar                 :     5.455 cycle/op (best)    5.479 cycle/op (avg)
RISC-V Vector (pack: gather)    :     5.908 cycle/op (best)    5.914 cycle/op (avg)
RISC-V Vector (pack: compress)  :     4.888 cycle/op (best)    4.896 cycle/op (avg)
RISC-V Vector (omit ws; pack: gather)   :     6.208 cycle/op (best)    6.219 cycle/op (avg)
RISC-V Vector (omit ws, pack: compress) :     5.217 cycle/op (best)    5.229 cycle/op (avg)
$ ./encode
input size: 3072, output size: 4096
number of iterations: 100
We report the time in cycles per input byte.
For reference, we present the time needed to copy 3072 bytes.
rdtsc_overhead set to 7
memcpy                          :     0.260 cycle/op (best)    0.263 cycle/op (avg)
memcpy (RVV)                    :     0.268 cycle/op (best)    0.271 cycle/op (avg)
scalar (32 bit)                 :    12.069 cycle/op (best)   12.416 cycle/op (avg)
scalar (64 bit)                 :     9.474 cycle/op (best)   10.070 cycle/op (avg)
SWAR (64 bit)                   :    10.696 cycle/op (best)   10.960 cycle/op (avg)
RISC-V RVV (LMUL=1)             :     5.129 cycle/op (best)    5.405 cycle/op (avg)
RISC-V RVV (LMUL=8)             :     6.561 cycle/op (best)    6.793 cycle/op (avg)
RISC-V Vector (LMUL=4, segmented load)  :     3.603 cycle/op (best)    3.708 cycle/op (avg)
RISC-V Vector (LMUL=8, strided load, gather)    :     4.671 cycle/op (best)    4.775 cycle/op (avg)

Edit: I think I forgot to add the bitmanip extension to the compilation flags on the C908, causing the SWAR implementations to underperform. I'll include them next time.

camel-cdr commented 3 months ago

I've written another decode implementation using segmented load/stores and the three 4-bit LUT method: https://godbolt.org/z/7qc1xhMao

It's the fastest on the C908:

input size: 4194304
number of iterations: 100
We report the time in cycles per output byte.
For reference, we present the time needed to copy 3145728 bytes.
rdtsc_overhead set to 7
memcpy                          :     0.498 cycle/op (best)    0.516 cycle/op (avg)
memcpy (RVV)                    :     0.500 cycle/op (best)    0.506 cycle/op (avg)
scalar                          :     4.836 cycle/op (best)    4.853 cycle/op (avg)
improved scalar                 :     4.834 cycle/op (best)    4.854 cycle/op (avg)
RISC-V Vector (pack: gather)    :     5.795 cycle/op (best)    5.802 cycle/op (avg)
RISC-V Vector (pack: compress)  :     4.975 cycle/op (best)    4.980 cycle/op (avg)
RISC-V Vector (omit ws; pack: gather)   :     6.347 cycle/op (best)    6.360 cycle/op (avg)
RISC-V Vector (omit ws, pack: compress) :     5.181 cycle/op (best)    5.197 cycle/op (avg)
RISC-V Vector (camel; seg)      :     3.546 cycle/op (best)    3.563 cycle/op (avg)

Note that this could be even faster compared to the others on better hardware, as the C908 executes LMUL=1 and LMUL=2 instructions at the same speed (as far as I can tell), and the last 12 instructions are LMUL=1 instructions.

C920:

input size: 33554432
number of iterations: 100
We report the time in cycles per output byte.
For reference, we present the time needed to copy 25165824 bytes.
rdtsc_overhead set to 16
memcpy                          :     0.503 cycle/op (best)    0.523 cycle/op (avg)
memcpy (RVV)                    :     1.575 cycle/op (best)    1.582 cycle/op (avg)
scalar                          :     3.197 cycle/op (best)    3.212 cycle/op (avg)
improved scalar                 :     3.196 cycle/op (best)    3.212 cycle/op (avg)
RISC-V Vector (pack: gather)    :     1.321 cycle/op (best)    1.325 cycle/op (avg)
RISC-V Vector (pack: compress)  :     1.132 cycle/op (best)    1.138 cycle/op (avg)
RISC-V Vector (omit ws; pack: gather)   :     5.664 cycle/op (best)    5.688 cycle/op (avg)
RISC-V Vector (omit ws, pack: compress) :     5.520 cycle/op (best)    5.545 cycle/op (avg)
RISC-V Vector (camel; seg)      :     9.893 cycle/op (best)    9.938 cycle/op (avg)

As expected the segmented load/stores are very slow on the C920, but I also realized that we almost can't use the C920 measurements at all, because the codegen is horrible for XTheadVector: https://godbolt.org/z/d458vzvqc

It generates insane amounts of redundant moves and redundant vsetvlis.

gcc in general doesn't handle register allocation with vcreate and vget very well, even when compiling for standard rvv it inserts a few redundant moves: https://godbolt.org/z/j57Y137xG