How long must long-expired batches be served

davidben commented 1 year ago

As currently written, neither CAs nor the transparency service are allowed to stop serving long-outdated batches. Should we relax this? Some nuisances:

Storage windows should be much larger than validity windows, to allow for RP clock skew
If the CA's storage window is smaller than the TS's storage window, it is possible that a transparency service will fail to log something if it's really, really behind. That said, it would need to be behind by more than the CA's storage window and if that's, say, a year, that's probably fine.
A bigger issue is that if the TS spins up after the CA does, it can't bootstrap itself with data before the CA's storage window (unless it pulls it from another mirror)

bwesterb commented 1 year ago

Storage windows should be much larger than validity windows, to allow for RP clock skew

For the common case, where the roots are side-loaded to the RP, the transparency service can also side-load an approximate time. The latest batch you got is an estimate of the time!

if it's really, really behind

If a transparency service is that much behind, then it can't perform its main function: allowing clients to authenticate. (Transparency service is perhaps not the best name.)

If everything functions normally, I'd say that the CAs would really only need to store batches for another day or so. On the other hand, when things go wrong, it's nice to have some time to figure things out. We shouldn't go overboard here: it's harder for the CAs to serve a year's worth of batches, than three weeks.

Indeed, the size of assertions is dominated by the size of the public keys. If they're all RSA-2048, we're looking at roughly 1.5GB per day. Serving 32GB for three weeks is much easier than 560GB. If they're all Dilithium3, then we're looking at 250GB for three weeks and 4.2TB for a year. See also #6 .

I want to try making running this as easy as possible, so at the moment I'd lean towards a storage window of 4 weeks.

davidben commented 1 year ago

For the common case, where the roots are side-loaded to the RP, the transparency service can also side-load an approximate time. The latest batch you got is an estimate of the time!

Yup! Though I think we should still ponder clock skew because the RP's clock may get off after this update and then fail to take new updates.

Also I think a CA's recent history, even if the cert has since been expired, is still relevant in the limit for monitoring it. After all, transparency is inherently a post-facto thing. It takes time for a logged certificate to be observed by someone. Though, yeah, maybe we don't need a whole year's worth? I dunno. It's also really the TS's history that matters, provided the TS doesn't get too behind.

If a transparency service is that much behind, then it can't perform its main function: allowing clients to authenticate.

Well, it has a few functions:

Allowing RPs to accept assertions
Providing a public view of all assertions the CA has made
Maintaining the invariant that RPs only accept assertions that have been mirrored

But you're right that a TS that's so far behind isn't doing anything. It's still meeting the primary invariant, but the system's effectively down at that point. :-)

And perhaps a bigger point is that, support the TS is down for a bit and loses some entries because they were past the CA's storage window. If zero TS instances ever saw it, while it's unfortunate to lose a view of CA misbehavior, we know that no RP ever accepted it at any point in time, so it's not a huge deal.

(Transparency service is perhaps not the best name.)

Originally called it "update service" to kinda reflect what I expect to be the most common deployment pattern, but transparency service seemed a better generalization, especially in some of the other deployment models where it's a bunch of services together. But not attached to the name.

I want to try running this as easy as possible, so at the moment I'd lean towards a storage window of 4 weeks.

Yeah, I think you've convinced me that, at least for the CA, it's both useful and not particularly harmful to have a short storage window. Though the language around the TS and the HTTP interface overall would need to account for it. And we probably formally need to allow the TS to have holes in this "TS went offline and got more than 4 weeks behind" state.

davidben / merkle-tree-certs

How long must long-expired batches be served #2