caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
57.78k stars 4.02k forks source link

Upgrading from 2.6.2 to 2.7.3 seems to require removal of on_demand rate limiting until certs are cached #5724

Closed rhurling closed 1 year ago

rhurling commented 1 year ago

Hi, I'm not 100% sure if this was intended, but I think something in the change from 2.6.2 to 2.7.3 caused issues while I was upgrading our production instance to 2.7.3.

On upgrade to 2.7.3 many certs customer sites failed to load with an ERR_SSL_PROTOCOL_ERROR error and the following debug error in the caddy logs: http: TLS handshake error from X.X.X.X:X: certificate is not allowed for server name www.example.com: decision func: on-demand rate limit exceeded When I downgraded from 2.7.3 to 2.6.2 all domains could be served again, which also didn't seem to make sense.

I managed to fix for our production instance it by removing the parameters interval and burst from the global on_demand_tls option.
Which seemed weird to me, because according to the docs this rate limiting is for cert actions directly and not directly for the callback itself (even if only to check if the domain is allowed)?

I think in 2.6.2 caddy seemed to first check the storage and load the cert to the internal cache if it was found regardless of whether the callback allowed it or not, but in 2.7.3 caddy seems to first check if its allowed and then look whether the is in storage?
Weirdly this issue did not matter after i re-added the interval and burst a day after we upgraded to 2.7.3 and reloaded the config... because the cache seems to have been filled by now and caddy now reuses the cache so...

I'm unsure whether this issue will now crop up on caddy/server restarts, since I don't have a test setup for this.

Our Caddyfile:

{
        email example@example.com

        grace_period 15s

        storage redis {
                host "127.0.0.1"
                port 6379
                db 1
                key_prefix "caddytls"
                value_prefix "caddy-storage-redis"
                timeout 5
                aes_key "XXXXXXXXX"
        }

        on_demand_tls {
                ask http://localhost:8001/allow-on-demand-tls.php
                interval 3m
                burst 5
        }

        log {
                format filter {
                        wrap console
                        fields {
                                request>remote_addr ip_mask {
                                        ipv4 24
                                        ipv6 58
                                }
                        }
                }
        }
}

https:// {
        tls {
                on_demand
        }

         header -Server

        reverse_proxy http://localhost:8000 {
                header_down -Server
                header_up X-Forwarded-Port "443"
                header_up X-Forwarded-Path {http.request.orig_uri.path}

                trusted_proxies X.X.X.X/32
        }
}

We build caddy via xcaddy: xcaddy build --with github.com/gamalan/caddy-tlsredis --with github.com/caddy-dns/hetzner --with github.com/caddyserver/replace-response

I'm unsure if I should have first reported my issue via caddy.comunity, so I'll repost my issue there if that works better.

mholt commented 1 year ago

Thanks for bringing this up. In 2.7 we did slightly adjust the on-demand restrictions (ask and rate limit) to be applied before hitting storage to see if a cert is already there. This helps prevent stampedes to storage and limits the load (and costs!) to storage backends.

However, now that 'ask' is required, I think we should deprecate the rate limit because the 'ask' endpoint can implement its own throttling if desired.

(Is there any good reason to keep the rate limit?)

I just updated the release notes and docs (that will go out soon), I overlooked this before the release. (Sorry.)

Now that 'ask' (and the rate limit check) is performed before accessing storage:

So this does result in less storage I/O, but it may cause rate limits to appear over-used since previously you planned on them only being for how many certs to obtain from a CA. We now protect storage as well.

I'd recommend removing the throttle since it'll likely be removed in a future version, or at least raise the limits.

rhurling commented 1 year ago

Ah, okay... that basically covers my problem and I'll remove the rate limit. The added release notes seem to cover this issue so I'll close this.

robgordon89 commented 1 year ago

🤔 I can see the benefits and the downsides of this approach, this could cause stampede on the application ask endpoint IMO.

For example we run inside Kubernetes and so when a deployment is restarted it loses its "in memory" cache and so the benefits here for us are not even felt and could actually cause more issues on our backend than a quick Redis storage lookup etc.

Is there a way we could make this configurable?

AskForStorage: false 
mholt commented 1 year ago

@robgordon89 Interesting, that's unfortunate that Kubernetes does that instead of using established graceful reload mechanisms to keep things lightweight.

Making it configurable, i.e. putting the 'ask' endpoint between checking storage and querying the CA may be a little involved.

Let's get your company on a Business+ sponsorship and I'll be happy to work on a solution for you right away. Reach out to me if you have any questions: matt -at- dyanim dot com.