Making results easier to reason about + memcpy comparison

It was difficult to determine what the "operation unit" was... I made it explicit. I also added a memcpy reference, which seems essential. The gist of the story seems to be that encoding/encoding is roughly half the speed of a memcpy (if you use a handcrafted naive AVX-512-aware memcpy).

I think we need to worry about alignment. With AVX-512, the registers are the size of a cache line... If the code was in C, it would be trivial to do an aligned malloc, but in C++, it is less obvious. Still, this should be investigated. There may be noticeable differences.

Numbers below are from a cannonlake microarchitecture.

Encoding speed...

$ make ./benchmark_avx512vl && ./benchmark_avx512vl
make: `benchmark_avx512vl' is up to date.
input size: 3072
number of iterations: 10000
We report the time in cycles per input byte.
For reference, we present the time needed to copy 3072 bytes.
rdtsc_overhead set to 20
memcpy                          :     0.045 cycle/op (best)    0.046 cycle/op (avg)
warning: your data pointers are unaligned: 16 32
memcpy (avx512)                 :     0.024 cycle/op (best)    0.027 cycle/op  #(avg)
AVX512VBMI                      :     0.051 cycle/op (best)    0.053 cycle/op (avg)
AVX512VL                        :     0.045 cycle/op (best)    0.046 cycle/op (avg)

Decoding...

$ make ./benchmark_avx512vbmi && ./benchmark_avx512vbmi
g++  -Wall -Wextra -pedantic -O3 -std=c++14 -mbmi2 -DHAVE_BMI2_INSTRUCTIONS -mavx512vbmi -mbmi2 -DHAVE_AVX512VBMI_INSTRUCTIONS -DHAVE_BMI2_INSTRUCTIONS  benchmark.cpp -o benchmark_avx512vbmi
input size: 4096
number of iterations: 10000
We report the time in cycles per output byte.
For reference, we present the time needed to copy 3072 bytes.
rdtsc_overhead set to 20
memcpy                          :     0.046 cycle/op (best)    0.047 cycle/op (avg)
warning: your data pointers are unaligned: 32 48
memcpy (avx512)                 :     0.025 cycle/op (best)    0.026 cycle/op (avg)
...
AVX512VBMI (lookup: N/A, pack: multiply-add)    :     0.060 cycle/op (best)    0.063 cycle/op (avg)

WojciechMula / base64simd

Making results easier to reason about + memcpy comparison #6