how does the compression work?

After extensively reading the code, my best guess was:

I think the algorithm (compressed_map) is using "arithmetic coding" (or some homegrown alternative). But I can explain it better using Huffman encoding.

The size of a probabilistic map is (some constant C) (the number of keys we want to guarantee we get the correct answer for) (the size of each value) (if we ignore taking advantage of "accidentally getting the correct answer"). At first blush it seems like we're stuck. The size of the size of each value is basically fixed, as are the number of keys we care about getting the correct answer for. We could deal with different bits of the value in different structures, but our formula is linear so that doesn't get us anything. It takes the same amount of space to say we have N structures that each encode 1 bit as it does if we say we have 1 structure that encodes N bits.

That's where Huffman encoding comes in. We encode the values according to the Huffman algorithm. Then we process things one bit at a time. The first data structure encodes the first bit and needs to be correct for every key. So it takes quite a bit of space, it avoids being a disastrous amount because it's only one bit wide. We can do this again for the next bit but this time we can skip any keys whose Huffman encoding has already reached a leaf node. We don't care what value is returned by the structure for that bit of that key, because a reader processing that key will have found a value. With every additional bit we have fewer keys to process and therefore a smaller data structure.

and you responded with:

My compressed_map crate can be thought of as more or less the same idea as a Huffman code plus a BinaryFuse-like matrix map, much as @Eh2406 suggests. It's not quite a Huffman code because the possibility of getting the correct answer "by accident" changes the cost model, but it's the same idea.

I would love to hear more about how you analyze the cost model, and how you took advantage of getting the correct answer by accident. But I don't want to sidetrack that issue anymore than I already have. So I thought I would open a separate issue for you to expound on your insights, if you're willing to.

Just fyi, his 20 min talk "Improved CRL compression with structured linear functions" at RWC 2022 discusses this some

So you could use a Huffman code and you would get the usual cost: like if you have up to 4 bits in the code, and foo has an encoding 11??, then for each key that maps to foo you have to constrain the output value to exactly 11 for the first two maps (or two-bit outputs from a single CompressedRandomMap), and the other two you don't care about so you don't constrain them. Each constraint costs 1+epsilon bits, where epsilon is the overhead of the underlying CompressedRandomMap. So this example costs 2(1+epsilon) bits for each key that maps to foo.

However, you don't have to constrain the outputs of these linear maps, and they will give you instead (depending on the hashes and whatever) a pseudorandom answer, which is an option not present in most applications of Huffman codes. So you can consider also a non-power-of-two size, like foo might instead have an encoding {1011, 11??}. Encoding proceeds from the right, and you don't constrain the value of the foos in the rightmost two bits, because it has valid encodings no matter what those two bits are (as far as I can tell this is optimal). Then some of the foo values will happen to land on 11, since it's pseudorandom. For those values, the left two bits can be either 10 or 11 and they will decode correctly, so you encode them as 1?, costing one bit. For the other ones, you must set both of the first two bits to get the right encoding, costing two bits.

So the cost in this case would be 1.75 bits per foo, because approximately 1/4 of them cost 1 bit and the other 3/4 of them cost 2 bits. You encode from right to left because the other direction would cost more: first bit always costs you, the second doesn't, and then half the time you are in the 10 case and have to spend 2 more bits to finish out the 1011 branch, for a total cost of 2 bits per foo.

It turns out that it's always optimal for every encoding except for one to be a power-of-two size. So it's almost like a Huffman code. I'm not sure how it compares to Huffman in general, but non-power-of-two sizes are definitely a win for cases like CRLite where the ratios are like 1% revoked vs 99% not, and a Huffman code would be wasteful because it spends at least one bit to encode each value.

That talk was fascinating. Thank you for sharing and presenting respectively. Here's some of what I got out of it.

The Huffman/Arithmetic encoding discussed above reduces the arbitrary case down to a series of exact set membership tests. CompressedRandomMap can store this in (some constant C) * (the number of keys we want to guarantee we get the correct answer for) * (the size of each value). We can ignore C, because everything I'm thinking about is linear in C, and it's 1.001 for frayed ribbon according to the presentation. This takes one bit per key.

If our values have a skewed enough distribution, then we can use ApproxSet to do better. Without loss of generality, I'm going to assume that 1 is the less common value. And that 1 occurs P portion of the time. We can use a ApproxSet first and then a full CompressedRandomMap only on the keys included in the ApproxSet. In this 2 level scheme this will take (2^epsilon)*keys*P for the ApproxSet and P*keys + keys*(1-P)/(2^epsilon) for the CompressedRandomMap. In practice epsilon has to be an integer for the value == hash schemes, and should be a power of two to avoid unaligned reads. Playing with that formula in Excel this strategy is worthwhile for P < 0.2 with epsilon = 1. epsilon = 2 beats that out for P < 0.1111. epsilon = 3 beats that out for P < 0.0154. This is a really cool insight. Thank you again.

Going a little further, the second layer here is just another case of the exact set membership tests. This time with P*keys + keys*(1-P)/(2^epsilon) keys, and with a new P of (2^epsilon) * P / ((2^epsilon) * P - P + 1). This new P is generally > 0.2 so CompressedRandomMap is the best choice. But if the original P < 0.055, the new P ends up small enough to be worth adding 3ed layer!

And while I was writing this up, you replied. So next on my to do list is to read your comment!

That is a very cool algorithm for reusing the accidental values. How do you construct the encoded keys such that you have a high probability of these happening?

It turns out that it's always optimal for every encoding except for one to be a power-of-two size.

That's very interesting. Why does that end up happening? I would have guessed you want, as much as possible, the number of symbols that map to the same foo to be proportional to the number of foos in the data sets.

I'm not sure how it compares to Huffman in general, but non-power-of-two sizes are definitely a win for cases like CRLite where the ratios are like 1% revoked vs 99% not, and a Huffman code would be wasteful because it spends at least one bit to encode each value.

If I understand Huffman encodings correctly, a symbol that represents > 50% of the data is guaranteed to be length one. So for that ratio we should automatically give us the most flexibility 1?????. Even so if that symbol is completely dominating the distribution then we end up in the skewed case where using a ApproxSet+CompressedRandomMap fits well.

bitwiseshiftleft / compressed_map

how does the compression work? #5