meyira commented 3 months ago

Summary of MTC Fallback Estimates

We aim to estimate the probability of triggering the fallback to another PKI mechanism when Merkle Tree Certificates are deployed widely. We estimate the fallback probabilities from the server side by monitoring historic certificate data, checking for incorrect configurations and potential attacks that risk the client becoming out of sync.

Estimation Method

We queried all certificates since 2022-1-1 from 75 of the top 100 domains on Radar. We will call them top domains. In addition, we have queried certificates since 2022-1-1 for domains with a siphash ending in 0000. We call those domains random domains. Note that the statistics for those domains may have a large rate of false positives for the fallback estimations, as certificates for random-domain.tld are in the sample, but certificates for www.random-domain.tld are not. Therefore, our estimates are a rather loose upper bound, but we believe them to be useful nonetheless.

Fallback triggers

We have identified the following triggers for fallbacks so far:

New domains: A new domain needs time to propagate its new tree heads. Visitors to the new domain immediately trigger the fallback mechanism until they have received the correct tree head. Since domains are not visited uniformly, a new domain may experience low traffic and therefore not trigger the fallback often.
Expired Certificates: An expired certificate immediately triggers a fallback. Looking at historic data, this is the most likely fallback trigger, however, expiring certificates are a configuration problem that would be eliminated by the widespread use of MTC certificates.
Changes between Certification Authorities: A change between CA authorities may indicate a movement of the domain. Looking at data where the CA did not change, we found that most renewals happen 14, 30 and 90 days before a certificate expires. Using this data, we distinguish between two cases by monitoring the overlap between the certificates:
1. If the overlap is 14, 30 or 90 days (+-24 hours), we assume the change in CA is planned.
2. Otherwise, we assume the change is not planned and needs to happen immediately, e.g., due to an unplanned mitigation of the domain. This would trigger a fallback. 34% of switches of a CA would be in this edge case.
3. Before assuming a fallback in case 3.2., we check if have another certificate for the same CA that overlaps the current certificate for >= 1s. If so, we assume again that the change was planned and we do not need the fallback.

Probability of Fallbacks

Fallback Trigger	Top Domains, relative	Top Domains, over 2.5 years	Random Domains	Random Domains, over 2.5 years
new domain (daily)	0 %	0 %	0.07 %	0.07 %
validity gaps * (over 2.5 years)	0 %	0.003 %	0.01 %	3.03 %
Irregular Domain Move (over 2.5 years)	0.004 %	1.1 %	0.01 %	3.3 %
Overall	0.004 %	1.1 %	0.09 %	3.4 %

The table above provides the absolute probability of triggering a fallback over 2.5 years and the relative probability, assuming it takes three days for the relying party synchronize with the correct tree heads.

We warmly welcome further possible fallback triggers.

devonobrien commented 3 months ago

Hi meyira!

First off, thanks for doing this analysis! Building performance estimates from real-world data is super useful for understanding how MTCs can operate in practice. Please forgive the length of this response; It’s a bit rambly as I work out how to reason about these measurements.

So, the question I think we want to answer is something along the lines of: “In a world in which MTCs are deployed, how often, and in what circumstances do we expect the fallback to be needed for certificate validation to succeed?”

When a MTC is issued, there will be a period of time in which clients cannot validate the MTC because they have not yet received a tree head that corresponds to the batch containing this issuance (let’s call this the fallback period). For convenience, we can organize clients into rough buckets:

Clients that support MTCs and reliably receive dynamic updates containing batch heads, where the fallback period is dominated by vendor update frequency (e.g. Chrome would be something like <= 6 hours).
Clients that support MTCs but do not receive timely dynamic updates (e.g. some enterprises that disable automatic updates but roll them out on some cadence after vetting, possibly on the scale of O(days)).
Clients that don’t support MTCs and will always need a fallback.

Buckets 1 and 2 are related to fresh MTC issuance to some degree, while 3 represents an independent population whose behavior is unaffected by issuance or update cadence. Since fresh MTC issuance is difficult to measure without MTCs being widely deployed, we can attempt to infer the likelihood of fallback for bucket 1 (and bucket 2 to a lesser extent) by looking at publicly-available certificate and domain data and identifying scenarios that would force a MTC fallback:

Brand new domains – We can infer fresh MTC issuance for domains that didn’t previously exist, but each domain experiences this ~once in their entire existence, so it’s a very rough lower bound. Due to the time it generally takes for a domain to become a top domain, I expect this to be significantly more meaningful for the siphash random domains.
Urgent re-issuance –When certificates need to be relied upon immediately after issuance, it forces even bucket 1 clients into a fallback period. This can be caused by leaf certificate compromise, sudden or unexpected CA distrust by some client, demand-based server scaling for sites where hosts cannot share existing certificates / keys, etc.
MTC Staleness – If a domain no longer has a time-valid MTC, all traffic will need to use a fallback authentication mechanism. This is difficult to measure right now because we don’t have a good grasp on the frequency of operational issues that would lead to an inability to obtain and serve fresh MTCs. - MTC Incorrectness – If a MTC is mis-issued, or otherwise fails to validate, servers will be forced to use a fallback certificate. This is both a good and bad change over the status quo: today, servers rarely have the option of serving something other than their default certificate, but as a result, malformed or incorrect certificates are detectable immediately, where this might go unnoticed for some time in a highly automated MTC environment.

Before digging too far into the numbers, does this list look right? I think our ability to measure fallback today might be limited to inferring behavior about predominantly bucket 1 -type clients from a limited amount of observable data related to the above scenarios.

meyira commented 3 months ago

Hi, Thanks for looking over the estimates and insisting on refining the criteria! I agree they are still very rough, but I wanted to get a bit of a feeling for the likeliness of triggering a fallback. The bucket division is interesting! I agree that bucket 1 is the only one I likely can estimate, however, bucket 3 is also unlikely to check SCTs right now. For Bucket 2, the fallback probability likely depends on how much configuration MTC is going to offer to clients, and how long the cadence period is. I guess estimating bucket 2 would need an actual deployment. The list of the criteria looks good to me, except for the MTC incorrectness: Is it expected behaviour to fall back when the MTC is malformed?

bwesterb commented 3 months ago

I don't quite follow the MTC Incorrectness scenario. Are you proposing that on receiving an invalid certificate, the client automatically retries asking for a different one? I don't think that's very desirable, as you indeed loose visibility into misconfigurations and other errors. This is separate from servers being able to negotiate and support multiple certificates.

On bucket 2. I'd say it makes sense to broaden this to all willing but potentially stale clients because of the network (eg. airplane, strict network firewall.) It'd be great to have better insight into such client staleness some way.

devonobrien commented 3 months ago

Ah, yes. Incorrectness actually doesn’t matter for this analysis; you’re right. It’s a situation where having some other certificate would be helpful if it were able to be negotiated but not actually useful when measuring expected MTC fallback behavior.

Re: client staleness, we looked at whether this could be inferred from existing metrics but I don’t think we have a great way to measure this directly today. We could likely add the relevant metrics in a A/B test but that would have to wait for prototyping and experimental rollout to gather.

davidben / merkle-tree-certs

MTC Fallback Estimates #89

Summary of MTC Fallback Estimates

Estimation Method

Fallback triggers

Probability of Fallbacks