Closed bnprks closed 1 month ago
Recommended resources to get up-to-speed on the BP-128 codecs:
pack128
and unpack128
functions in test-bp128.cpp
show the vanilla BP-128 algorithm in a very readable formatThanks for taking a look over all this! I added a few more comments in to help point to reference materials from in the code
When re-running some benchmarks, I observed 30-40% performance regressions for in-memory BP-128 compression relative to the pre-Gentech code (e.g. commit 91ed30). This turned out to be caused by the shift from macro-based code to C++ lambda-based code. For the BP-128 kernels, it is extremely important to avoid loops or function calls in the core pack/unpack function, but the lambda-based code was not properly inlining some of the lambdas.
This PR makes two important changes and one unimportant change that may be useful for the future.
Important change 1: Shift back to using pure C macros to construct the core BP-128 kernels. These are the
BP128_*_DECL
macros, which try to isolate the core bitpacking logic in one place so all the variants need only specify their arguments and the transform logic applied pre/post bitpacking. Due to it being macros rather than C++ lambdas it's very ugly, but this should be code that rarely (if ever) needs updating. The good news is that due to the use of macros there's no function calls or loops for the compiler to improperly leave in.Important change 2: reducing unnecessary memory copies in the in-memory benchmarking code
As shown in the table below, switching back to a macro-based solution took the performance hit from 30-40% to 5-10%, then reducing the memory copies in the in-memory benchmark allowed substantial improvements.
Unimportant change: adding the
Nx128
wrapper functions, which basically allow packing/unpacking multiple 128-integer chunks with a single function call. This reduces the number of indirect function calls required and can result in some modest time savings due to amortizing the highway library's dynamic dispatch (to use the best version for the available hardware features). However, the remaining big bottleneck is extra memory copies due to the existingUIntReader
andUIntWriter
classes. So it seemed not worth it to integrate theNx128
functions intoBP128UIntReader/Writer
given that it would be tricky and just need to be redone when/if the existing interfaces are improved.To measure these overheads, I did a bit of benchmarking with just the vanilla BP-128 codec (no transforms), which should be the most impacted by overheads. This compared the existing benchmark code to bypassing the
BP128UIntReader/Writer
classes and just calling theBPCells::simd::bp128
bitpacking functions directly on the memory of the arrays.BP128UIntReader/Writer
[un]pack_Nx128()
[un]pack()
BP-128