Open Eh2406 opened 1 day ago
Just fyi, his 20 min talk "Improved CRL compression with structured linear functions" at RWC 2022 discusses this some
So you could use a Huffman code and you would get the usual cost: like if you have up to 4 bits in the code, and foo
has an encoding 11??
, then for each key that maps to foo
you have to constrain the output value to exactly 11
for the first two maps (or two-bit outputs from a single CompressedRandomMap
), and the other two you don't care about so you don't constrain them. Each constraint costs 1+epsilon bits, where epsilon is the overhead of the underlying CompressedRandomMap
. So this example costs 2(1+epsilon) bits for each key that maps to foo
.
However, you don't have to constrain the outputs of these linear maps, and they will give you instead (depending on the hashes and whatever) a pseudorandom answer, which is an option not present in most applications of Huffman codes. So you can consider also a non-power-of-two size, like foo
might instead have an encoding {1011, 11??}. Encoding proceeds from the right, and you don't constrain the value of the foo
s in the rightmost two bits, because it has valid encodings no matter what those two bits are (as far as I can tell this is optimal). Then some of the foo
values will happen to land on 11
, since it's pseudorandom. For those values, the left two bits can be either 10
or 11
and they will decode correctly, so you encode them as 1?
, costing one bit. For the other ones, you must set both of the first two bits to get the right encoding, costing two bits.
So the cost in this case would be 1.75 bits per foo
, because approximately 1/4 of them cost 1 bit and the other 3/4 of them cost 2 bits. You encode from right to left because the other direction would cost more: first bit always costs you, the second doesn't, and then half the time you are in the 10
case and have to spend 2 more bits to finish out the 1011
branch, for a total cost of 2 bits per foo
.
It turns out that it's always optimal for every encoding except for one to be a power-of-two size. So it's almost like a Huffman code. I'm not sure how it compares to Huffman in general, but non-power-of-two sizes are definitely a win for cases like CRLite where the ratios are like 1% revoked vs 99% not, and a Huffman code would be wasteful because it spends at least one bit to encode each value.
That talk was fascinating. Thank you for sharing and presenting respectively. Here's some of what I got out of it.
The Huffman/Arithmetic encoding discussed above reduces the arbitrary case down to a series of exact set membership tests. CompressedRandomMap
can store this in (some constant C) * (the number of keys we want to guarantee we get the correct answer for) * (the size of each value)
. We can ignore C
, because everything I'm thinking about is linear in C
, and it's 1.001
for frayed ribbon according to the presentation. This takes one bit per key.
If our values have a skewed enough distribution, then we can use ApproxSet
to do better. Without loss of generality, I'm going to assume that 1
is the less common value. And that 1
occurs P
portion of the time. We can use a ApproxSet
first and then a full CompressedRandomMap
only on the keys included in the ApproxSet
. In this 2 level scheme this will take (2^epsilon)*keys*P
for the ApproxSet
and P*keys + keys*(1-P)/(2^epsilon)
for the CompressedRandomMap
. In practice epsilon
has to be an integer for the value == hash
schemes, and should be a power of two to avoid unaligned reads. Playing with that formula in Excel this strategy is worthwhile for P < 0.2
with epsilon
= 1. epsilon
= 2 beats that out for P < 0.1111
. epsilon
= 3 beats that out for P < 0.0154
. This is a really cool insight. Thank you again.
Going a little further, the second layer here is just another case of the exact set membership tests. This time with P*keys + keys*(1-P)/(2^epsilon)
keys, and with a new P
of (2^epsilon) * P / ((2^epsilon) * P - P + 1)
. This new P
is generally > 0.2
so CompressedRandomMap
is the best choice. But if the original P < 0.055
, the new P
ends up small enough to be worth adding 3ed layer!
And while I was writing this up, you replied. So next on my to do list is to read your comment!
That is a very cool algorithm for reusing the accidental values. How do you construct the encoded keys such that you have a high probability of these happening?
It turns out that it's always optimal for every encoding except for one to be a power-of-two size.
That's very interesting. Why does that end up happening? I would have guessed you want, as much as possible, the number of symbols that map to the same foo
to be proportional to the number of foo
s in the data sets.
I'm not sure how it compares to Huffman in general, but non-power-of-two sizes are definitely a win for cases like CRLite where the ratios are like 1% revoked vs 99% not, and a Huffman code would be wasteful because it spends at least one bit to encode each value.
If I understand Huffman encodings correctly, a symbol that represents > 50% of the data is guaranteed to be length one. So for that ratio we should automatically give us the most flexibility 1?????
. Even so if that symbol is completely dominating the distribution then we end up in the skewed case where using a ApproxSet+CompressedRandomMap
fits well.
After extensively reading the code, my best guess was:
and you responded with:
I would love to hear more about how you analyze the cost model, and how you took advantage of getting the correct answer by accident. But I don't want to sidetrack that issue anymore than I already have. So I thought I would open a separate issue for you to expound on your insights, if you're willing to.