aarongable / draft-acme-ari

Internet Draft for the Automated Certificate Management Environment (ACME) Renewal Information (ARI) Extension
Other
3 stars 7 forks source link

Consider recommending clients log and audit "immediate renewal" scenarios as a best-practice #46

Open jvanasco opened 1 year ago

jvanasco commented 1 year ago

Consider a subsection under the "Getting Renewal Information" titled "Immediate Renewal Scenarios".

It should explain some situations in which a "renew now" payload is sent, and a security audit or server configuration audit may be necessary.

Possible text:

## Immediate Renewal Scenarios

If the `end` of the `suggestedWindow` is in the past, clients **SHOULD** attempt renewal immediately.

Instances like this **SHOULD** be logged and investigated.  If the ARI payload contains an `explanationUrl`, it should be logged and reviewed.

There are three main scenarios that will cause a Certificate to require immediate renewal:

* The Certificate has been revoked through the ACME API.  This is often due to an intended action of the Certificate's owner, but this can be due to the result of a malicious actor who obtains the credentials necessary for revocation.  
* The Certificate has been revoked by the Certificate Authority in a mass revocation event.  This could be due to a minor violation of the CA/B Baseline Requirements or a significant security concern.
* The client has failed to refresh ARI information on the recommended basis. This could be due to the client or a scheduler being incorrectly configured.

The above list is not exhaustive; other scenarios can require immediate renewal.  It is recommended that all immediate renewal situations are logged and audited.  In situations where there is a security concern, such as an unintended revocation, subscribers should ensure the revocation was an unintended consequence of their actions and not due to a security compromise.  Subscribers may decide to inspect Certificate Transparent Logs for the affected domains.  In situations where there may be a misconfiguration of the client or a scheduler, subscribers should audit their integrations to ensure future Certificates are renewed in a timely manner.
mholt commented 2 months ago

If the end of the suggestedWindow is in the past, clients SHOULD attempt renewal immediately.

Isn't that already implied in the spec? (Maybe it wasn't a year ago when this issue was posted.) The recommended algorithm is to renew if the selected date within the window is in the past, and if end is in the past, then necessarily any date in the window is in the past too.

My reading of the current draft suggests no additional text is needed.

jvanasco commented 2 months ago

No change to the algorithm is suggested by this. That line could be updated to whatever the current "Immediate Renewal" (IR) language is.

The purpose of this Issue is to recommend the logging and investigation of all detected IRs – and explain that in most situations, an IR (as indicated by a past expiration time in the payload) is almost certainly because of a server misconfiguration or CA revocation.

AFAIK, the only times a Subscriber should expect an IR are:

Unless you're expecting an IR payload for those reasons, detecting one generally means that something, somewhere, has broken – so Clients should log and alert Subscribers to investigate.

IMHO, aside from those two situations, the most likely causes of an IR are going to be (in descending order):

petercooperjr commented 1 month ago

I don't know if you might just call this "Server misconfiguration", but it might be good to consider cases where the server isn't up 24/7. For home appliance-type servers (like a network-attached storage device or whatnot), if I have it turned off for a week when I go on vacation, and then when I turn it back on it turns out that it missed its ARI window and needs to renew immediately, that really isn't a "critical failure that needs to be investigated now" scenario.

I think the real indicator of something wonky requiring investigation might just be whether that explanationUrl is present. But there isn't much guidance to CAs on when to populate it, so I could imagine one CA always populating it with a link to their regular documentation about preferred renewal timelines, and another CA only populating it when initiated by a compliance incident. It might be worth having it be a "SHOULD" or "RECOMMENDED" or the like for that explanationUrl field to be populated if the certificate was revoked (or is about to be revoked), and maybe guidance that it shouldn't be populated for "normal" time windows.

jvanasco commented 1 month ago

A server with periodic connectivity is a specific case with a minority of users, and obviously/inherently not a misconfiguration. We can easily generate a long list of other specific usages and edge cases that will affect 2%-20% of users - that should not prevent advising the majority of Subscribers that a missed ARI window is likely something that should be quickly looked at.

petercooperjr commented 1 month ago

Sure; not really objecting. I'm just not sure how much of this "implementation guidance" should be in the RFC, vs. some other place. (And I mean that "not sure" sincerely, this may be the best place for it.) And all I was trying to do is to ensure that the server with intermittent connectivity was thought of somewhere in the process, I know it's not the common use case.

aarongable commented 1 month ago

I think this is largely why I'm not in favor of adding language to this effect. There are many reasons for "immediate renewal" scenarios:

Also:

I could imagine one CA always populating [the explanationURL] with a link to their regular documentation about preferred renewal timelines

Let's Encrypt plans to do exactly this, so I'm not a fan of saying CAs should avoid populating it

petercooperjr commented 1 month ago

Yeah, I think the thing that needs to be logged and investigated is if the certificate wasn't renewed before some percentage of its lifetime. Even if it renews really early because of some CA incident, there really isn't anything for the server owner to do in that case; everything went exactly as it should.

We regards to explanationURL, I guess I'm just not sure exactly what the client or administrator should really do with that information. I guess it could be helpful to be logged just in case the administrator is curious about why a certificate was renewed early, but again if the renewal is successful then I don't think there's any action they should really be taking. The scenario that might be more meaningful is when the suggested window is in the past and renewal fails, and in that case the administrator needs to be alerted because it may indicate a future problem (CA planned downtime or incident requiring revoking or whatnot) that they need to figure out how to work around (by ensuring that their system answering challenges is up, switching CAs, or whatever), and the explanationURL might help them understand the impact.

mholt commented 1 month ago

Yeah, I think the thing that needs to be logged and investigated is if the certificate wasn't renewed before some percentage of its lifetime.

We do this in Caddy/CertMagic... if it's the last 1/20th or 1/50th of its lifetime (there's two code paths - one for ARI, one without) we emit a slightly louder log saying that we're renewing now (in the ARI code path, we specifically mention that we're ignoring ARI at that point).

We regards to explanationURL, I guess I'm just not sure exactly what the client or administrator should really do with that information. I guess it could be helpful to be logged just in case the administrator is curious about why a certificate was renewed early, but again if the renewal is successful then I don't think there's any action they should really be taking. The scenario that might be more meaningful is when the suggested window is in the past and renewal fails, and in that case the administrator needs to be alerted because it may indicate a future problem (CA planned downtime or incident requiring revoking or whatnot) that they need to figure out how to work around (by ensuring that their system answering challenges is up, switching CAs, or whatever), and the explanationURL might help them understand the impact.

Yeah, we are just logging the explanationURL, and if there's a failure, you can go into the same logs that report the error, and find the explanationURL. :man_shrugging: