Open justSem opened 2 years ago
Hmm, how exactly do you issue TLS certificates? Do you add a TXT record for the ACME challenge? If so, cert-manager should handle this problem by itself and I would argue that this issue should rather be redirected to cert-manager than Switchboard.
In my experience, cert-manager handled quite well that DNS propagation takes a few minutes but I have only used it on the Google Cloud.
We handle ACME certs by standard TLS challenges because we're used to it being more speedy then waiting for a TXT record (since the default certbot scripts from the monolithic days used a 100s wait time).
I could try a DNS based issuer to see if that helps. i'll get back to you on that - but of course the problem with the TLS challenges still exists in that case.
To follow-up: Using the DNS-01 solver indeed solves our issues. However, as stated before, the behavior still persists when a situation occurs in which the HTTP-01 provider has to be used.
Thanks for the follow-up @justSem! I get the issue now, but I'm unsure whether Switchboard is the right place to solve it.
One thing that would be possible for cert-manager is to bypass DNS caches by querying your authoritative server directly. In fact, I found an open issue that attempts to tackle the problem you describe if I'm not mistaken (see https://github.com/cert-manager/cert-manager/issues/4246).
I'm a bit reluctant to put this into Switchboard as delaying interactions with enabled integrations adds quite some complexity.
While testing this in one of our environments we've ran into an issue regarding DNS propagation.
In this case we're running a K8S cluster on DigitalOcean, which also manages the domain. We haven't changed any DNS-related configuration from DO-Defaults.
The behavior we're observing is that cert-manager tries to request TLS certificates before DigitalOcean has processed the DNS changes - resulting in cert-manager receiving NXDomain responses until resolver caches have been cleared.
This increases the "wait period" for the entire thing to go through increases from <60s to +3H.
Unfortunately none of our developers done anything with Go, so manually implementing our changes would be time consuming if we want to do it properly (so it's actually production-worthy).
We feel like it'd help to either:
Obviously the first one is the easiest to implement, and would be more then sufficient for most use cases.