Performance fix for BP-128 kernels

bnprks commented 2 months ago

When re-running some benchmarks, I observed 30-40% performance regressions for in-memory BP-128 compression relative to the pre-Gentech code (e.g. commit 91ed30). This turned out to be caused by the shift from macro-based code to C++ lambda-based code. For the BP-128 kernels, it is extremely important to avoid loops or function calls in the core pack/unpack function, but the lambda-based code was not properly inlining some of the lambdas.

This PR makes two important changes and one unimportant change that may be useful for the future.

Important change 1: Shift back to using pure C macros to construct the core BP-128 kernels. These are the BP128_*_DECL macros, which try to isolate the core bitpacking logic in one place so all the variants need only specify their arguments and the transform logic applied pre/post bitpacking. Due to it being macros rather than C++ lambdas it's very ugly, but this should be code that rarely (if ever) needs updating. The good news is that due to the use of macros there's no function calls or loops for the compiler to improperly leave in.

Important change 2: reducing unnecessary memory copies in the in-memory benchmarking code

As shown in the table below, switching back to a macro-based solution took the performance hit from 30-40% to 5-10%, then reducing the memory copies in the in-memory benchmark allowed substantial improvements.

Code source	read val (GB/s)	read idx (GB/s)	write val (GB/s)	write idx (GB/s)
pre-highway 91ed30	8.49	5.74	8.99	5.69
github main de87d45	5.35	3.29	5.75	4.13
add lambda->macro change	7.27	5.06	8.20	5.11
also reduce benchmark copies	12.18	7.65	9.46	5.72

Unimportant change: adding the Nx128 wrapper functions, which basically allow packing/unpacking multiple 128-integer chunks with a single function call. This reduces the number of indirect function calls required and can result in some modest time savings due to amortizing the highway library's dynamic dispatch (to use the best version for the available hardware features). However, the remaining big bottleneck is extra memory copies due to the existing UIntReader and UIntWriter classes. So it seemed not worth it to integrate the Nx128 functions into BP128UIntReader/Writer given that it would be tricky and just need to be redone when/if the existing interfaces are improved.

To measure these overheads, I did a bit of benchmarking with just the vanilla BP-128 codec (no transforms), which should be the most impacted by overheads. This compared the existing benchmark code to bypassing the BP128UIntReader/Writer classes and just calling the BPCells::simd::bp128 bitpacking functions directly on the memory of the arrays.

Method	write (GB/s)	read (GB/s)
`BP128UIntReader/Writer`	8.60	9.81
`[un]pack_Nx128()`	14.1	11.9
loop `[un]pack()` BP-128	13.8	9.17

bnprks commented 2 months ago

Recommended resources to get up-to-speed on the BP-128 codecs:

original paper by Daniel Lemire
The pack128 and unpack128 functions in test-bp128.cpp show the vanilla BP-128 algorithm in a very readable format
Bitpacking repo from Lemire that was an important reference for the original implementation. Rather tough to read but the header files might be informative

bnprks commented 1 month ago

Thanks for taking a look over all this! I added a few more comments in to help point to reference materials from in the code

bnprks / BPCells

Performance fix for BP-128 kernels #131