davidben / tls-trust-expressions

Other
5 stars 2 forks source link

How much could we compress the list of trust anchors? #64

Open bwesterb opened 3 months ago

bwesterb commented 3 months ago

I do not think clients should send a lot of trust anchors. Nor do I think we should complicate the standard with a compression mechanism like this. It's good to think about though.

Coding subsets of integers

Theoretical limit

For the moment, let's simplify and assume trust anchors are numbers instead of OIDs. Say there are k=200 anchors with maximum value N=62,203 (number of PENs today). There are N choose k possible subsets of anchors to communicate. Thus, we cannot expect to code every subset with less than lg N choose k bits. We can approximate this well using Stirling's approximation: 243 bytes.

Rice coding

We can get quite close to this limit using a very simple method: Rice coding of differences. For each delta we compute q = floor(d / 2^b) and r = d mod 2^b, where b = round(-lg(-lg(1-k/N))) = 8. q is encoded in unary and r in binary, so 600 would be encoded as 0b11001011000. The 110 encodes q=2; the remainder is the remainder r. The average size using Rice coding for random subsets is about 244.3 bytes with a 95th percentile of 244.8 bytes.

Rice coding doesn't take advantage of big gaps. If we want to code {N-200, N-199, ..., N-1} that takes 255.25 bytes.

Huffman bitlength

A different approach is to write each delta in two parts: first code its bitlength, and then to write the delta (without its most significant bit). For the bitlengths we can use a Huffman code, which can be compressed efficiently using canonical trees and delta coding of the code lengths as in bzip2. (Here is an implementation.)

For random subsets this performs worse than Rice coding: 253 bytes on average with a 95th percentile of 255 bytes. However, it encodes {N-200, N-199, ..., N-1} in only 50 bytes, and there is room for improvement.

davidben commented 2 months ago

Another goofy encoding idea I had was just ranges. The old MTC scheme could be done with aliases on the issuance side ("if the client says batch 27, assume they also know batches 10 through 26"), but it could also be done by letting you efficiently encode things like 32473.123.{10-27} (32473.123 is the MTC CA prefix).

I could also imagine taking some of the goofy exclusion labels that TE had and instead recasting them as some kind of sequence of ranges. If we assume we don't need to punch too many holes, splitting ranges in two isn't horrible. And you could periodically rebalance by numbers by allocating new aliases higher up the number space, trusting that newer certs are aware of it. Not sure.

The DNS mechanism does handily remove the need for all these encoding tricks, but depends on DNS, so... 🤷