Open burmanm opened 5 years ago
As an implementation note, the optimal selection of words vs left-greedy was not beneficial in inverted indexing when it came to the Simple-family as shown in the paper "Optimal Packing in Simple Family Codecs".
In the MILC-paper they arrive with a different conclusion, so it might be worth investigating the optimal solution in page-size vs. left-greedy approach. Latter one might not provide significantly worse compression ratio, but could provide large performance improvement.
Taking some inspiration from MILC: Inverted List Compression in Memory, we should implement their algorithm but add some changes for Java and some improvements for compression ratio:
No SIMD tricks (for now), but we should still ensure other features (such as direct access). With scalar processing and Java memory handling we might need to also use different memory layout for the "skip pointers"
Since dynamic programming is used to determine optimal block size, we should also choose the optimal offset which would use a minimal amount of bits in the block. Slows down compression but improves compression ratio. Original paper uses the first value always.
The original paper does not mention what happens when offset is negative, thus we will probably need to use ZigZag to encode the values to avoid issues/branches in decompression