certbot / certbot

Certbot is EFF's tool to obtain certs from Let's Encrypt and (optionally) auto-enable HTTPS on your server. It can also act as a client for any other CA that uses the ACME protocol.
Other
31.66k stars 3.41k forks source link

Change renewal interval to better support certificates valid for less than 90 days #9261

Open alexzorin opened 2 years ago

alexzorin commented 2 years ago

In #9253, we want to add support for --certificate-validity (name is not final), which allows the user to request a different certificate lifetime from the CA. Currently, Google's ACME CA supports this.

I also noticed that https://smallstep.com/blog/private-acme-server/ has some advice about modifying the renewal parameters directly in order to reduce renew_before_expiry to 8 hours.

For the UX of this feature to work, we would need Certbot to do the right thing when encountering certificates that have a lifetime other than 90 days.

Let's Encrypt's integration guide suggests that clients renew certificates at 1/3rd of their lifetime. That sounds to me like a good idea whether the lifetime is 90, 9 or 3 days, but let's hear what everyone thinks. We could potentially make Certbot pick a different renew_before_expiry value when it creates/updates the lineage, since the certificate lifetime should be known at that point in the process.

We could also add a --renew-before-expiry flag (stale PR here: https://github.com/certbot/certbot/pull/7995). Though we should give it a better name, maybe --renewal-cutoff.

It would be nice to get some perspectives about the design of this improvement. renew_before_expiry is this weird commented-out item in the renewal parameters and I'm not sure what else might be hidden there.

Leaving this unlabelled so it shows up in triage.

bmw commented 2 years ago

renew_before_expiry is this weird commented-out item in the renewal parameters and I'm not sure what else might be hidden there.

The existence of renew_before_expiry probably predates my involvement on the project, however, I'm luckily not aware of any weirdness here and nothing stood out to me when grepping for it in our code.

I think it's a good idea that we have decent default logic here and not just rely on the user also remembering to set a flag like --renewal-cutoff. For simplicity, I think I'd prefer to not add a flag like that for now and we can add one if significant demand arises.

As for what the default logic should be, 1/3rd of their lifetime is a pretty good idea. I noticed that this is what https://smallstep.com/blog/private-acme-server/ recommends as well for their 24 hour certificates. My only concern about this is very short lifetimes when Certbot is configured to run certbot renew approximately every 12 hours through many of our distribution channels. Maybe 1/3rd of their lifetime with the lowest value being 24 hours? We could potentially make the lowest interval even shorter, but I'm being a little conservative because I want to make sure Certbot has a chance to renew the certificate at least once before the certificate expires and I don't immediately know what that value should be to accomplish that due to the randomization added to the renewal timer in our snaps, Debian & EPEL packages, and on Windows.

What do you think?

alexzorin commented 2 years ago

Thanks!

Maybe 1/3rd of their lifetime with the lowest value being 24 hours?

Yes, this was what I was thinking. What do you think about setting a max value as well?

The most common certificate lifetime for public CAs, other than 90 days, is 365 days.

My first impression is that 30 days is still the most suitable renewal cutoff for 365 day certificates. Renewing at 120 days (or even longer for private CAs) seems .. wasteful? On the other hand, min(30, max(24, lifetime / 3)) is a bit of a mouthful and I'm a little worried about confusing users.

For simplicity, I think I'd prefer to not add a flag like that for now and we can add one if significant demand arises.

This sounds good as well. Probably the only demand for this is going to be for private ACME CAs.

nothing stood out to me when grepping for it in our code.

A couple more questions come up for me now:

  1. Is this a breaking change, does it need to be deferred to 2.0?
  2. How are we going to handle the transition of existing lineages? i.e. for lineages that have the commented # renew_before_expiry = 30 days entry, what do we do?
    • We could adopt the new formula for these lineages, keep the comment around and overwrite its value with the new value?
bmw commented 2 years ago

On the other hand, min(30, max(24, lifetime / 3)) is a bit of a mouthful and I'm a little worried about confusing users.

It is a bit of a mouthful, but I don't have another idea I like better. How about you? I think we can come up with a half way decent sentence or two explaining it and further flesh out what we mean in our docs if we want.

  1. Is this a breaking change, does it need to be deferred to 2.0?

Technically I suppose it is a breaking change since it's a behavior change, but as long as we continue respecting uncommented renew_before_expiry values, it's hard for me to imagine this realistically causing (m)any problems so I personally think we can just go for it. Do you agree?

  1. How are we going to handle the transition of existing lineages? i.e. for lineages that have the commented # renew_before_expiry = 30 days entry, what do we do?
    • We could adopt the new formula for these lineages, keep the comment around and overwrite its value with the new value?

For me, I think that depends on a couple things:

  1. Are we going to use a different variable name for this feature or stick with renew_before_expiry? Are we able to do math on the lifetime of the certificate and write renew_before_expiry in a somewhat sane format with the way it currently works?
  2. Do we want to advertise that people can tinker with this setting in the renewal config file?

If the answer to all of that is yes, I think we should do what you said and keep the comment around but update its value. What are your thoughts?

alexzorin commented 2 years ago

I think we can come up with a half way decent sentence or two explaining

Maybe we can go with a slightly longer changelog entry:

it's hard for me to imagine this realistically causing (m)any problems so I personally think we can just go for it. Do you agree?

I am scared of some DANE folks potentially being affected, or somebody who was sed-ing the comment we have now, but nothing really common stands out to me. If you feel OK, I feel OK too.

Are we able to do math on the lifetime of the certificate and write renew_before_expiry in a somewhat sane format with the way it currently works?

There is something slightly yucky here when we combine this with #9253.

Basically, Google CA accepts a notAfter value to newOrder, but not a notBefore value. As such, the validity length of the certificate you receive is going to be smudged by however long it took to submit and complete the order.

So if you use this proposed feature to request a certificate for 60 days, the actual certificate validity length will be something like 60 days±30 seconds. A third of that is going to be some non-round (relative to calendar units) value, so we're going to end up with something gross like:

# renew_before_expiry = 1728002 seconds

🤮 . This may fundamentally be more of an issue with the other PR, rather than what we're talking about here, but I'd like to maybe get ahead of it and see whether we can design around this more elegantly.

osirisinferi commented 2 years ago

If you don't like that kind of numbers as values, you shouldn't be using that detailled granularity. By having the granularity in seconds, you're also accepting these kinds of values IMO :slightly_smiling_face: And with that, nothing yucky about it. :+1:

bmw commented 2 years ago

The longer changelog entry sounds good to me.

I am scared of some DANE folks potentially being affected, or somebody who was sed-ing the comment we have now, but nothing really common stands out to me. If you feel OK, I feel OK too.

For the sed-ing scenario, I personally don't feel too bad since https://eff-certbot.readthedocs.io/en/stable/compatibility.html doesn't list that file as having stable behavior. Is there a DANE scenario that you think will cause enough frustration/backlash to give you pause here? Mainly because our behavior with LE certs should be unaffected, I'm not too worried here but I certainly could be overlooking something.

A third of that is going to be some non-round (relative to calendar units) value, so we're going to end up with something gross like:

# renew_before_expiry = 1728002 seconds

Building off of Osiris' comment, what if we rounded the value to days?

Another idea would be to deprecate this key-value pair and add a new key-value pair with seconds (or some other unit) in the name of the parameter and store the value as an integer.

alexzorin commented 2 years ago

I quite like the idea of rounding to days. And making the flag's unit based on days too.

I'm not super keen about deprecating renew_before_expiry today, just based on workload. Unless people think there is a good reason in favor of doing it sooner rather than later.

DANE only concerned me about certificates obtained via ACME CAs which are not publicly trusted, since DANE enables that approach to trust. So they may have some non-90 day lifetime, and changing from fixed 30 days to 1/3rd lifetime, may have some consequences.

bmw commented 2 years ago

Ah I see. To be honest, that DANE case makes me a little nervous. Please let me know if you think this is overkill, but what if we essentially only changed the default lifetime if --certificate-validity is set? Since it's a new flag, that's not a breaking change.

If we wanted to apply the change more widely in the future, we could do something like calculate 1/3rd lifetime for all certificates and if the result is not 30 days, we print a warning saying our behavior will change in our next major release. We'd probably also then need to provide something like a --renew-cutoff(-days|-hours) flag to allow people to silence this warning before the release comes out.

alexzorin commented 2 years ago

I'm pretty happy with the plan here:

If we see an ACME CA come out with a different default lifetime, or if some other need arises, the second half of your comment sounds good too.

Going to take this out of triage.

bmw commented 2 years ago

Sounds great. Thanks Alex.

Just to avoid a potential bug, I wanted to flag that I think the math should be min(30, max(1, lifetime / 3)). I think I got us started down that path when I started talking about 24 hours instead of days.

github-actions[bot] commented 1 year ago

We've made a lot of changes to Certbot since this issue was opened. If you still have this issue with an up-to-date version of Certbot, can you please add a comment letting us know? This helps us to better see what issues are still affecting our users. If there is no activity in the next 30 days, this issue will be automatically closed.

github-actions[bot] commented 1 year ago

This issue has been closed due to lack of activity, but if you think it should be reopened, please open a new issue with a link to this one and we'll take a look.

bmw commented 11 months ago

For knowledge sharing purposes, the reason this is valuable is to help CAs offer/default to shorter lived certificates which they have interest in. Shorter lived certs are also better for security purposes because certificate revocation on the web is essentially broken.

If we do nothing here, we'll try to renew when the certificate has less than 30 valid days remaining which is very awkward for certificates with a less than 30 day lifetime.

ohemorange commented 9 months ago

Hello to this issue from 2024. Looking over the discussion here, while I agree with the behavior when a flag is set, I think the important thing to actually address here is what to do when a CA changes its default cert lifespan.

To that end, while I understand that we're worried about DANE and private CAs, I don't actually understand what the issue would be. Is it something like, someone is using certbot for automatic renewals but has a completely separate script to change their DANE settings somewhere? What is the workflow we're worried about here?

If we wanted to apply the change more widely in the future, we could do something like calculate 1/3rd lifetime for all certificates and if the result is not 30 days, we print a warning saying our behavior will change in our next major release.

I agree with this. While I think you could probably read our compatibility statement as using "behavior" to refer to file types and locations, I don't think waiting for a major release and warning before is a problem so no reason to not play it safe.

We'd probably also then need to provide something like a --renew-cutoff(-days|-hours) flag to allow people to silence this warning before the release comes out.

I don't love a flag that explicitly sets the cutoff, because things might change again. I'd prefer something like, --accept-default-renewal or --I-don't-care-just-renew-it-when-you-think-is-best (flag name needs workshopping but you see what I'm getting at here).

bmw commented 9 months ago

Is it something like, someone is using certbot for automatic renewals but has a completely separate script to change their DANE settings somewhere?

right. to be honest, my brain seems to have rightly or wrongly largely purged itself of the details of DANE, but i think the concern is if the renewal frequency changes, DNS records might not get updated in time so that clients will trust the new certificate. since certbot has no support for DANE, they'd need to be at least largely doing this themselves and if they were relying on certbot's current renewal interval, us changing it may break things.

this is an extreme edge case though that i think alex and i were mainly discussing because we were considering making the change outside of a major release. if we do it in a new major version and do our best to warn people ahead of time, i personally don't think we need to worry about this.

I don't love a flag that explicitly sets the cutoff, because things might change again. I'd prefer something like, --accept-default-renewal or --I-don't-care-just-renew-it-when-you-think-is-best (flag name needs workshopping but you see what I'm getting at here).

i agree. good idea.

ohemorange commented 9 months ago

since certbot has no support for DANE, they'd need to be at least largely doing this themselves and if they were relying on certbot's current renewal interval, us changing it may break things.

I would hope that by now people are using hooks for this sort of thing. Especially because our renewal interval will retry if it fails and so it won't necessarily be on the exact cyclic schedule. So I think that even out of the edge cases where someone might be doing those sort of updates, it will be an even narrower subset that might be affected. But I agree that either way if we do it in a major release it's definitely for sure not going to be a problem.

this is an extreme edge case though that i think alex and i were mainly discussing because we were considering making the change outside of a major release. if we do it in a new major version and do our best to warn people ahead of time, i personally don't think we need to worry about this.

I had missed that context about this being done outside of a major release, so I'm particularly glad you pointed that out.

bmw commented 9 months ago

ah yes, that's my read of the above at least. alex brought up DANE in response to my response to the question "Is this a breaking change, does it need to be deferred to 2.0?"

osirisinferi commented 4 weeks ago

Please see #10037: this issue has becomre quite relevant nowadays apparently.

zoracon commented 1 week ago

As an update. There is a renewed timeline for this and I will share the email we got from Let's Encrypt with the team so we can discuss this work.