It was difficult to determine what the "operation unit" was... I made it explicit. I also added a memcpy reference, which seems essential. The gist of the story seems to be that encoding/encoding is roughly half the speed of a memcpy (if you use a handcrafted naive AVX-512-aware memcpy).
I think we need to worry about alignment. With AVX-512, the registers are the size of a cache line... If the code was in C, it would be trivial to do an aligned malloc, but in C++, it is less obvious. Still, this should be investigated. There may be noticeable differences.
Numbers below are from a cannonlake microarchitecture.
Encoding speed...
$ make ./benchmark_avx512vl && ./benchmark_avx512vl
make: `benchmark_avx512vl' is up to date.
input size: 3072
number of iterations: 10000
We report the time in cycles per input byte.
For reference, we present the time needed to copy 3072 bytes.
rdtsc_overhead set to 20
memcpy : 0.045 cycle/op (best) 0.046 cycle/op (avg)
warning: your data pointers are unaligned: 16 32
memcpy (avx512) : 0.024 cycle/op (best) 0.027 cycle/op #(avg)
AVX512VBMI : 0.051 cycle/op (best) 0.053 cycle/op (avg)
AVX512VL : 0.045 cycle/op (best) 0.046 cycle/op (avg)
Decoding...
$ make ./benchmark_avx512vbmi && ./benchmark_avx512vbmi
g++ -Wall -Wextra -pedantic -O3 -std=c++14 -mbmi2 -DHAVE_BMI2_INSTRUCTIONS -mavx512vbmi -mbmi2 -DHAVE_AVX512VBMI_INSTRUCTIONS -DHAVE_BMI2_INSTRUCTIONS benchmark.cpp -o benchmark_avx512vbmi
input size: 4096
number of iterations: 10000
We report the time in cycles per output byte.
For reference, we present the time needed to copy 3072 bytes.
rdtsc_overhead set to 20
memcpy : 0.046 cycle/op (best) 0.047 cycle/op (avg)
warning: your data pointers are unaligned: 32 48
memcpy (avx512) : 0.025 cycle/op (best) 0.026 cycle/op (avg)
...
AVX512VBMI (lookup: N/A, pack: multiply-add) : 0.060 cycle/op (best) 0.063 cycle/op (avg)
It was difficult to determine what the "operation unit" was... I made it explicit. I also added a memcpy reference, which seems essential. The gist of the story seems to be that encoding/encoding is roughly half the speed of a memcpy (if you use a handcrafted naive AVX-512-aware memcpy).
I think we need to worry about alignment. With AVX-512, the registers are the size of a cache line... If the code was in C, it would be trivial to do an aligned malloc, but in C++, it is less obvious. Still, this should be investigated. There may be noticeable differences.
Numbers below are from a cannonlake microarchitecture.
Encoding speed...
Decoding...