Closed rhurling closed 1 year ago
Thanks for bringing this up. In 2.7 we did slightly adjust the on-demand restrictions (ask and rate limit) to be applied before hitting storage to see if a cert is already there. This helps prevent stampedes to storage and limits the load (and costs!) to storage backends.
However, now that 'ask' is required, I think we should deprecate the rate limit because the 'ask' endpoint can implement its own throttling if desired.
(Is there any good reason to keep the rate limit?)
I just updated the release notes and docs (that will go out soon), I overlooked this before the release. (Sorry.)
Now that 'ask' (and the rate limit check) is performed before accessing storage:
So this does result in less storage I/O, but it may cause rate limits to appear over-used since previously you planned on them only being for how many certs to obtain from a CA. We now protect storage as well.
I'd recommend removing the throttle since it'll likely be removed in a future version, or at least raise the limits.
Ah, okay... that basically covers my problem and I'll remove the rate limit. The added release notes seem to cover this issue so I'll close this.
🤔 I can see the benefits and the downsides of this approach, this could cause stampede on the application ask endpoint IMO.
For example we run inside Kubernetes and so when a deployment is restarted it loses its "in memory" cache and so the benefits here for us are not even felt and could actually cause more issues on our backend than a quick Redis storage lookup etc.
Is there a way we could make this configurable?
AskForStorage: false
@robgordon89 Interesting, that's unfortunate that Kubernetes does that instead of using established graceful reload mechanisms to keep things lightweight.
Making it configurable, i.e. putting the 'ask' endpoint between checking storage and querying the CA may be a little involved.
Let's get your company on a Business+ sponsorship and I'll be happy to work on a solution for you right away. Reach out to me if you have any questions: matt -at- dyanim dot com.
Hi, I'm not 100% sure if this was intended, but I think something in the change from 2.6.2 to 2.7.3 caused issues while I was upgrading our production instance to 2.7.3.
On upgrade to 2.7.3 many certs customer sites failed to load with an ERR_SSL_PROTOCOL_ERROR error and the following debug error in the caddy logs:
http: TLS handshake error from X.X.X.X:X: certificate is not allowed for server name www.example.com: decision func: on-demand rate limit exceeded
When I downgraded from 2.7.3 to 2.6.2 all domains could be served again, which also didn't seem to make sense.I managed to fix for our production instance it by removing the parameters interval and burst from the global on_demand_tls option.
Which seemed weird to me, because according to the docs this rate limiting is for cert actions directly and not directly for the callback itself (even if only to check if the domain is allowed)?
I think in 2.6.2 caddy seemed to first check the storage and load the cert to the internal cache if it was found regardless of whether the callback allowed it or not, but in 2.7.3 caddy seems to first check if its allowed and then look whether the is in storage?
Weirdly this issue did not matter after i re-added the interval and burst a day after we upgraded to 2.7.3 and reloaded the config... because the cache seems to have been filled by now and caddy now reuses the cache so...
I'm unsure whether this issue will now crop up on caddy/server restarts, since I don't have a test setup for this.
Our Caddyfile:
We build caddy via xcaddy:
xcaddy build --with github.com/gamalan/caddy-tlsredis --with github.com/caddy-dns/hetzner --with github.com/caddyserver/replace-response
I'm unsure if I should have first reported my issue via caddy.comunity, so I'll repost my issue there if that works better.