Conveying renewal urgency on a per-account basis

djc commented 6 months ago

I am the CTO of Instant Labs, we host some ~1500 domains with sites for our customers. I wrote the instant-acme client for our use cases. As such, I have followed the developments around ARI with some interest. Personally I feel that making one request per certificate appears to be quite wasteful, and like @mholt in their recent email thread (https://mailarchive.ietf.org/arch/msg/acme/AeJ3zJKcBF-ZUhQXJajC0bb7orI/) I wonder if there isn't a better way to do this?

I briefly discussed with Jacob HA whether it would be feasible to have a per-account endpoint that would just yield all the certificates that need updating in the next time window, but he mentioned that this would lead to highly variable response times which would be inconvenient for servers.

Which leads me to this proposal: would it be feasible to have an endpoint that just identifies the first certificate which needs renewal along with the date/time on which it should renew? This would allow clients to make a request to this endpoint and immediately go back to sleep until the given date/time.

(Our current approach is storing the certificate expiry explicitly in our database alongside the certificate, we wake up a worker process every few hours to query all certificates that expire in the next 30 days and start the renewal process. Under my proposal, our worker process would check the first-renewal endpoint when it is first started, wait until the window arrives and then run again. If the window has arrived, it would do the renewal and look at the first-renewal endpoint again, etc.)

mholt commented 6 months ago

I appreciate making things easier for clients.

If I understand correctly you're asking if the endpoint can simply tell the account owner when their next certificate renewal needs to be. Then after that cert is renewed, when is the next one? And so on.

One question I have is, What if the renewal date gets bumped earlier than the date from the first check?

djc commented 6 months ago

One question I have is, What if the renewal date gets bumped earlier than the date from the first check?

Fair point. I guess maybe the server would want to provide some retry-after like value that is indepent from the next certificate? This still seems like an improvement because it gives the server more control over how often clients check in. In either case, it reduces the overhead from per-outstanding certificate to per-account, which seems like an improvement.

mholt commented 6 months ago

There is something nice about the idea of just polling an endpoint per-account, though.

Client: Poll. ("Anything to renew?") Server: Nope. (repeat N times...) Client: Poll. Server: Renew these certs! Client: does so now Client: Poll. Server: And now renew this one! Client: does it Client: Poll. Server: Nothing to renew right now. ...

It does feel awkward to split scheduling between client and server, it feels like the server should just take charge of scheduling renewals instead of clients having to mix their own scheduling with a server's scheduling.

One problem with this could be a properly-positioned attacker -- or a network outage -- could simply drop packets or connections to the ARI server. Then again, that attacker could drop packets for the actual renewal transaction as well; so maybe not a huge concern.

Either way, ARI is going to involve lots of polling.

djc commented 6 months ago

It does feel awkward to split scheduling between client and server, it feels like the server should just take charge of scheduling renewals instead of clients having to mix their own scheduling with a server's scheduling.

Sure, but in that case we don't solve the issue of the midnight thundering herd, I presume? (Potentially it alleviates the issue if the first-renewal endpoint is cheaper to execute?)

mholt commented 6 months ago

Well and part of me has wondered -- and I think I mentioned this in that email thread earlier -- if there'd be some benefit to making ARI part of the normal ACME order workflow. Moving "replaces" into the Order object in draft-03 has already started that. I could imagine doing without a separate ARI endpoint, and instead the client just starts the renewal and sets the "replaces" field on the Order. If the server sees the "replaces" field populated but the ARI logic internally says it's not time yet, the server can simply respond with a graceful Retry-After (or something) and reject the new order. In other words, setting "replaces" tells the server that the client is honoring ARI and is just seeing if it's time to renew; if so, please renew, if not, we'll try again later.

I don't remember if I was convinced that a separate endpoint is really needed. I can't recall a reason why it can't just be wrapped up in the existing ACME flow.

aarongable commented 6 months ago

(This is a fascinating discussion, just noting that I'm on vacation and will reply mid-next-week.)

aarongable commented 6 months ago

Which leads me to this proposal: would it be feasible to have an endpoint that just identifies the first certificate which needs renewal along with the date/time on which it should renew? This would allow clients to make a request to this endpoint and immediately go back to sleep until the given date/time.

In my opinion, this defeats one of the purposes of ARI: that the window may shift at any time, in response to load spikes, predicted future load, revocation events, or otherwise. Simply sleeping until the indicated ARI time, without continuing to poll in the meantime, may cause clients to miss updates to the ARI suggested window.

As Matt suggested, the next obvious development from here is "what if we polled once per account instead of once per certificate?". I see two possible return values for this per-account polled endpoint:

It returns the single next certificate to renew. During a mass revocation event, a client which only polls the per-account endpoint every X minutes could easily fall behind, and a client which re-polls the per-account endpoint after each replacement would end up making just as many queries as one which is polling on a per-certificate basis.
It returns the whole collection of certificates which should be renewed right now. This results in all the same highly-variable response times, highly-variable response sizes, and paging issues as we've discussed before.

As may be obvious, I don't love either of these options.

In other words, setting "replaces" tells the server that the client is honoring ARI and is just seeing if it's time to renew; if so, please renew, if not, we'll try again later.

This brings no benefit. It only reduces the number of requests across the lifetime of a certificate by 1: the final "should I renew? yes. okay here's the new order" request pair becomes a single "here's a new order" which gets accepted instead of rejected. But all of the preceding requests become much heavier-weight requests containing significantly more data (and JWSes!), and all of the preceding replies become much more confusing error responses with limited data encoded in headers instead of meaningful data encoded in a dedicated JSON object.

Also, it prevents third-party monitors from ever being able to make an ARI request, which was one of the earliest design goals advocated for in early feedback.

Right now, I still don't see a clean and reasonable way to do bulk ARI. It's just not in line with how the rest of the ACME protocol thinks about orders and certificates.

djc commented 6 months ago

As Matt suggested, the next obvious development from here is "what if we polled once per account instead of once per certificate?". I see two possible return values for this per-account polled endpoint:

It returns the single next certificate to renew. During a mass revocation event, a client which only polls the per-account endpoint every X minutes could easily fall behind, and a client which re-polls the per-account endpoint after each replacement would end up making just as many queries as one which is polling on a per-certificate basis.

It returns the whole collection of certificates which should be renewed right now. This results in all the same highly-variable response times, highly-variable response sizes, and paging issues as we've discussed before.

I think I was suggesting an enumerated response, which can either have

checkBackIn: next window when the client should check in, or
renewCertificate: a certificate ID that requires renewal, and the client should check in again after renewal

This would mean that "idle" clients (for which there are no upcoming renewals) can be given a longer window and only have to send one request per account (instead of one request per certificate), while "busy" clients (for which there are upcoming renewals) will hit the endpoint once after each renewal to find the next renewal.

I guess a downside might be that very busy accounts (for which there is an upcoming renewal a majority of the time) are sort of "rate-limited"? Not sure if that would be perceived as a good thing or a bad thing?

aarongable commented 2 months ago

Apologies for not replying to this last comment earlier.

I think I was suggesting an enumerated response, which can either have

checkBackIn: next window when the client should check in, or

renewCertificate: a certificate ID that requires renewal, and the client should check in again after renewal

I think this defeats too much of the purpose of ARI, and places too much onus on the ACME server. The point is to give clients actionable information ahead of time, so that the clients can make intelligent decisions. If the only response a client gets is "oops, now it's time to renew this one!" it will be stuck in a constant tight loop, never making any headway. A client which queries ARI on a per-certificate basis and discovers that 100 certs all should be renewed in the same window can distribute those tasks to 100 different servers.

While this proposal does resolve some of the variable-response-size concerns of other per-account polling proposals, I don't think it actually solves any real problems and certainly creates other problems.

aarongable / draft-acme-ari

Conveying renewal urgency on a per-account basis #65