davidben / merkle-tree-certs

Other
9 stars 4 forks source link

MTC Fallback Estimates #89

Open meyira opened 3 months ago

meyira commented 3 months ago

Summary of MTC Fallback Estimates

We aim to estimate the probability of triggering the fallback to another PKI mechanism when Merkle Tree Certificates are deployed widely. We estimate the fallback probabilities from the server side by monitoring historic certificate data, checking for incorrect configurations and potential attacks that risk the client becoming out of sync.

Estimation Method

We queried all certificates since 2022-1-1 from 75 of the top 100 domains on Radar. We will call them top domains. In addition, we have queried certificates since 2022-1-1 for domains with a siphash ending in 0000. We call those domains random domains. Note that the statistics for those domains may have a large rate of false positives for the fallback estimations, as certificates for random-domain.tld are in the sample, but certificates for www.random-domain.tld are not. Therefore, our estimates are a rather loose upper bound, but we believe them to be useful nonetheless.

Fallback triggers

We have identified the following triggers for fallbacks so far:

  1. New domains: A new domain needs time to propagate its new tree heads. Visitors to the new domain immediately trigger the fallback mechanism until they have received the correct tree head. Since domains are not visited uniformly, a new domain may experience low traffic and therefore not trigger the fallback often.
  2. Expired Certificates: An expired certificate immediately triggers a fallback. Looking at historic data, this is the most likely fallback trigger, however, expiring certificates are a configuration problem that would be eliminated by the widespread use of MTC certificates.
  3. Changes between Certification Authorities: A change between CA authorities may indicate a movement of the domain. Looking at data where the CA did not change, we found that most renewals happen 14, 30 and 90 days before a certificate expires. Using this data, we distinguish between two cases by monitoring the overlap between the certificates:
    1. If the overlap is 14, 30 or 90 days (+-24 hours), we assume the change in CA is planned.
    2. Otherwise, we assume the change is not planned and needs to happen immediately, e.g., due to an unplanned mitigation of the domain. This would trigger a fallback. 34% of switches of a CA would be in this edge case.
    3. Before assuming a fallback in case 3.2., we check if have another certificate for the same CA that overlaps the current certificate for >= 1s. If so, we assume again that the change was planned and we do not need the fallback.

Probability of Fallbacks

Fallback Trigger Top Domains, relative Top Domains, over 2.5 years Random Domains Random Domains, over 2.5 years
new domain (daily) 0 % 0 % 0.07 % 0.07 %
validity gaps * (over 2.5 years) 0 % 0.003 % 0.01 % 3.03 %
Irregular Domain Move (over 2.5 years) 0.004 % 1.1 % 0.01 % 3.3 %
Overall 0.004 % 1.1 % 0.09 % 3.4 %

The table above provides the absolute probability of triggering a fallback over 2.5 years and the relative probability, assuming it takes three days for the relying party synchronize with the correct tree heads.

We warmly welcome further possible fallback triggers.

devonobrien commented 3 months ago

Hi meyira!

First off, thanks for doing this analysis! Building performance estimates from real-world data is super useful for understanding how MTCs can operate in practice. Please forgive the length of this response; It’s a bit rambly as I work out how to reason about these measurements.

So, the question I think we want to answer is something along the lines of: “In a world in which MTCs are deployed, how often, and in what circumstances do we expect the fallback to be needed for certificate validation to succeed?”

When a MTC is issued, there will be a period of time in which clients cannot validate the MTC because they have not yet received a tree head that corresponds to the batch containing this issuance (let’s call this the fallback period). For convenience, we can organize clients into rough buckets:

  1. Clients that support MTCs and reliably receive dynamic updates containing batch heads, where the fallback period is dominated by vendor update frequency (e.g. Chrome would be something like <= 6 hours).
  2. Clients that support MTCs but do not receive timely dynamic updates (e.g. some enterprises that disable automatic updates but roll them out on some cadence after vetting, possibly on the scale of O(days)).
  3. Clients that don’t support MTCs and will always need a fallback.

Buckets 1 and 2 are related to fresh MTC issuance to some degree, while 3 represents an independent population whose behavior is unaffected by issuance or update cadence. Since fresh MTC issuance is difficult to measure without MTCs being widely deployed, we can attempt to infer the likelihood of fallback for bucket 1 (and bucket 2 to a lesser extent) by looking at publicly-available certificate and domain data and identifying scenarios that would force a MTC fallback:

Before digging too far into the numbers, does this list look right? I think our ability to measure fallback today might be limited to inferring behavior about predominantly bucket 1 -type clients from a limited amount of observable data related to the above scenarios.

meyira commented 3 months ago

Hi, Thanks for looking over the estimates and insisting on refining the criteria! I agree they are still very rough, but I wanted to get a bit of a feeling for the likeliness of triggering a fallback. The bucket division is interesting! I agree that bucket 1 is the only one I likely can estimate, however, bucket 3 is also unlikely to check SCTs right now. For Bucket 2, the fallback probability likely depends on how much configuration MTC is going to offer to clients, and how long the cadence period is. I guess estimating bucket 2 would need an actual deployment. The list of the criteria looks good to me, except for the MTC incorrectness: Is it expected behaviour to fall back when the MTC is malformed?

bwesterb commented 3 months ago

I don't quite follow the MTC Incorrectness scenario. Are you proposing that on receiving an invalid certificate, the client automatically retries asking for a different one? I don't think that's very desirable, as you indeed loose visibility into misconfigurations and other errors. This is separate from servers being able to negotiate and support multiple certificates.

On bucket 2. I'd say it makes sense to broaden this to all willing but potentially stale clients because of the network (eg. airplane, strict network firewall.) It'd be great to have better insight into such client staleness some way.

devonobrien commented 3 months ago

Ah, yes. Incorrectness actually doesn’t matter for this analysis; you’re right. It’s a situation where having some other certificate would be helpful if it were able to be negotiated but not actually useful when measuring expected MTC fallback behavior.

Re: client staleness, we looked at whether this could be inferred from existing metrics but I don’t think we have a great way to measure this directly today. We could likely add the relevant metrics in a A/B test but that would have to wait for prototyping and experimental rollout to gather.