Open xmassx opened 5 years ago
This has been discussed before and we've avoided allowing it as we need some way to ensure that the challenge has propagated.
For DNS01, options like --dns01-recursive-nameservers
and --dns01-recursive-nameservers-only
help users that have DNS restricted environments that use DNS01.
I wonder if we can provide some other means to allow you to complete the self check without disabling it altogether? i.e. by overriding the server that we query for challenges?
/priority awaiting-more-evidence /help
@munnerz: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
Yeah, custom server for queries definitely make sense, in dns01 flags you did perfect job, but, i think this can be more confusing in http01. For me, its ok to create flag looks like --http01-external-address=10.0.0.10:80
or something like that for sure. Where user can set alternate service for proxying request to the k8s's public ip
But in that scenario user can use some kind of local created service with configured endpoints to the challenges, which de-facto like disabling at all.
For me, though, this behaviour perfectly fine, yes
I'd would like to disable the self-check, too: we have a k8s cluster with different inbound gateways and NAT and we can't hairpin the external DNS name to the correct internal IP in every scenario for every domain name, so internal checks against the external IP will timeout while requests from certbot's servers can read the challenge without problem.
We are also having this issue, while the http01 self check could be bypassed with hairpin nat, or an external split horizon DNS. in some cases this can be a real pain (Such as bootstrapping a system, see Rancher 2.0 HA install where you need an external LB)
I have the same issue : I cannot use cert-manager because of self-check tests. My router does not support hairpinning
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale
this would be a great feature. we are experiencing self check failures due to our DNS policy of only allowing internal DNS servers for internal lookups. the self check is the only thing preventing the challenge from completing.
/remove-lifecycle rotten
+1
+1
+1
Given we now better handle backing off when an Order fails, I think we could consider adding this as an option on the ACME solver.
Logically, it seems it'd make sense to make this an option that applies to both DNS01 and HTTP01 solvers.
If someone wants to give implementing this a go, please drop a comment here first so we can firm up the design details 😄
/cc @JoshVanL
/area acme /area api
In our case, the node running Nginx Ingress controller somehow is able to visit the HTTP01 endpoint.
So we used podAffinity to schedule cert-manager on the same node and it solves the issue.
Any progress on this? We are running kubernetes in a cloud provider that does not support hairpinning. Without this feature we couldn't deploy cert-manager successfully.
The root problem is in Kubernetes networking if you use LoadBalancer that is provided by the hosting. I use DigitalOcean. Kubernetes is not routing network through LB public interface so there is no adding PROXY protocol header or SSL if you are setting it outside Kubernetes. I use PROXY protocol and the moment when I enable it and update Nginx to handle it everything works but cert-manager fails as it is trying to connect to public domain name and that fails. It works from my computer as I am outside and LB is adding needed headers, but not from within the cluster.
Cert-manager is not guilty for this, but if we can add some switches where we can instruct validator to add PROXY protocol instead to disable validation for that domain it would help some of us a lot.
For curl if I do (from inside the cluster):
curl -I https://myhost.domain.com
it fails.
If I do (from inside the cluster):
curl -I https://myhost.domain.com --haproxy-protocol
it works.
Check this: https://github.com/jetstack/cert-manager/issues/863
The root problem is in Kubernetes networking if you use LoadBalancer that is provided by the hosting. I use DigitalOcean. Kubernetes is not routing network through LB public interface so there is no adding PROXY protocol header or SSL if you are setting it outside Kubernetes. I use PROXY protocol and the moment when I enable it and update Nginx to handle it everything works but cert-manager fails as it is trying to connect to public domain name and that fails. It works from my computer as I am outside and LB is adding needed headers, but not from within the cluster.
Cert-manager is not guilty for this, but if we can add some switches where we can instruct validator to add PROXY protocol instead to disable validation for that domain it would help some of us a lot.
For curl if I do (from inside the cluster):
curl -I https://myhost.domain.com
it fails.
If I do (from inside the cluster):
curl -I https://myhost.domain.com --haproxy-protocol
it works.
Check this: #863
I was informed by DigitalOcean team that there is a fix for this behavior. They added an additional annotation to nxinx-ingress controller service that forces Kubernetes to use domain name of public IP instead of IP and that tricks Kubernetes to think that it is not "ours" and routes network around through LB.
https://github.com/digitalocean/digitalocean-cloud-controller-manager/blob/master/docs/controllers/services/examples/README.md#accessing-pods-over-a-managed-load-balancer-from-inside-the-cluster This is it: (I just added this one)
kind: Service
apiVersion: v1
metadata:
name: nginx-ingress-controller
annotations:
service.beta.kubernetes.io/do-loadbalancer-hostname: "hello.example.com"
Hello, i wanna up that issue, my home cluster is behind nat and hairpin not possible with current router.
From outside ingress ports fully avaliable and working, but from inside that not works.
i have error: Waiting for http-01 challenge propagation: failed to perform self check GET request 'http://domain/.well-known/acme-challenge/ACME': Get http://domain/.well-known/acme-challenge/ACME: dial tcp 109.173.40.107:80: connect: connection timed out
link avaliable by internal address (for example if i test via my PC).
Is there is any way to specify address for self-check? or just disable self-check.
It would also be nice if that could be disabled by on-certificate(request) base.
I have an issue with MetalLB + externalTrafficPolicy: Local
where the cert-manager validator cannot access the solver since it's running on a different node than the "proxy" forwarding the requests to the solver.
Any thoughts on this?
I have the same issue as @MatthiasLohr. I recently introduced MetalLB to our cluster and I wasn't expecting certificate requests to stop working.
Does anyone know any workarounds for this?
Note: I'd prefer to keep the self-check, it feels like a good thing to have. Maybe specifying a specific IP adress or Kubernetes Service
that should be used instead? This would work for me, for example:
curl -H "Host: master.my-site.com.stage.example.com" nginx-external.ingress.svc.cluster.local/.well-known/acme-challenge/UQEly9jJVXURz9ggFx_6Ckrc4OKT0uBBMUr-3oDsvDA
But that assumes that all my certificate requests for that resolves goes through the same Ingress
controller, of course.
EDIT: I assume it's overly complex (and something we don't want to do here) to look at the IP address and see if matches a loadBalancerIP
in the cluster, and if it is, use the clusterIP
instead?
Anton has volunteered to put a design document together for this feature! A big thank you - it'll be great to get input on this document once it's ready from those that require this feature! 😄
/assign @anton-johansson
What does that mean? Any ETA, when this feature will be available?
@MatthiasLohr I'm currently working on a design document where we can decide the best solution. It'll be up shortly.
Thank you! Would be really nice to have this feature as soon as possible, currently the last thing required for a production setup... Trying a lot of workarounds but nothing is really reliable.
I'll do my best to get this included in the v0.15.0
release.
Awesome, thanks! If I can help somehow, please let me know.
@MatthiasLohr @WhitePhoera I am in your exact same situation. In my case, to workaround this, I created an internal DNS Zone with an entry matching the cert and pointed it to the IP address managed by MetalLB. It's by no mean a long term solution but at least the certificate validated.
I cannot find this feature in the helm chart options of v0.15.0? @anton-johansson did you implement the feature with the latest release?
Unfortunately, no, @Knowledge91. We decided to tackle the issue in another angle, and after that I've had my hands full with work.
Ok. Thanks for the update! Keep up the good work :)
P.S. I solved my problem by using my consumer router in bridge mode and installing OpenWrt on a Raspberry Pi, which has Nat Loopback build in.
Hey folks, any progress on that / solutions in sight?
+1 for me
Hey peeps! I apologize for this not being worked on. While doing some tests, it was clear that the solution wasn't going to work as expected. We discussed another solution, but that would require some rework and the time it would take - in combination with a lot of office hours - I just did not have time for it.
If anyone wants to resume the work, I'd be very happy!
PS: The reason that the solution isn't working as expected is that without performing this self-check (or ignoring its result), the challenge will be sent to Lets Encrypt before the cluster has had time to create the challenge Pod
and set up the Ingress
route.
If this can help anybody, what i have done is created IPTable Entries as follows, then I am OK to create the certificates as needed.
iptables -t nat -A PREROUTING -d 93.109.209.169 -p tcp --dport 80 -j DNAT --to 172.20.46.62:80 iptables -t nat -A PREROUTING -d 93.109.209.169 -p tcp --dport 443 -j DNAT --to 172.20.46.62:443 iptables -t nat -A POSTROUTING -s 172.20.46.0/24 -d 172.20.46.62 -p tcp --dport 80 -j MASQUERADE iptables -t nat -A POSTROUTING -s 172.20.46.0/24 -d 172.20.46.62 -p tcp --dport 443 -j MASQUERADE
93.109.209.169 is the Public IP Address which my domain name points to 172.20.46.62 is the Internal (private) ip address of my server 172.20.46.0/24 is the ip address range of any server on my network in CIDR format
In Addition, I also edited my /etc/hosts file and added 172.20.46.62 mydomain.com
After this, i was able to generate the certificates as was needed
I'm sorry but maybe we're overthinking/overengineering this... In some cases (NAT!), it is simply needed to skip the self-check. We could jump through hoops and come up with all kinds of convoluted solutions, or we could just give some control to the user of the software and add a flag that allows the self-check to be skipped.
In this case cert-manager is actively working against the user by enforcing checks that are already done by the ACME provider anyway. I don't see the harm in allowing these checks to be bypassed on a per-solver basis. Having the self-checks is a good, sane default, but forcing them makes for a bad user experience.
I completely agree!
I can see the point for "we don't know when we are up and ready for having LE checking us" - ok, then provide a second annotation for "wait x seconds until doing the LE request". And, even if we run into LE limits - that our problem, right?
The problem here is not rate limits it is that if a challenge fails too often it gets marked as invalid and a new one has to be requested thus cert-manager has to request a new token, set up the DNS/Ingress again and end up in a weird race condition. Which is what @anton-johansson hit against. Adding a time wait in seconds is a hack but it feels a bit sensitive to get in trouble with. Just thinking out loud here:
Any suggestions?
I think DNS is a different problem. If you have to use HTTP01 and the special setting mentioned above, the self check give a false negative feedback.
Btw: Why does cert-manager check for external access at all? Why not just rely on the Pod readiness state (which has to be correctly implemented then, of course). If the Pod is ready (kubernetes readiness check returns success), any reason why the Pod is maybe still not accessible from outside (e.g. ingress controller screwed up), it outside of the responsibility of cert-manager. Therefore, it should be also outside of the responsibility checking for that.
Do I miss something or would it be enough just to change the general behavior to monitor readinessState? IMHO, that would be a more clean solution anyways, regardless of the problem. That's what readinessChecks are made for.
Why does cert-manager check for external access at all?
It's in the RFC: Clients SHOULD NOT respond to challenges until they believe that the server's queries will succeed.
https://tools.ietf.org/html/rfc8555#section-8.2
Some ingresses take a while to route the routes to the solver pod, we have seen this happening causing issues. The solver pod in 99% of the cases starts way faster than the ingress is being ready. cert-manager should have the responsibility to follow the spec and check if everything is working before replying to the ACME server. Been talking to some co-workers and we have apparently seen that checking the pod probes still causes a race condition with the ingress (see the GCE one as a good example, it takes several minutes to propagate a route).
Personally I think cert-manager should obviously stick to the RFC by default, but is allowing an option to be added (on a per-solver basis) to bypass a self-check that simply doesn't work in some cases (again, NAT) really that bad? If I install Certbot on a server and run it, but my DNS is misconfigured, it will still perform the ACME request as all it sees is that the required vHost files are present on the server. It, too, will cause a user's rate limit to be exceeded if attempted too often. Is that such a problem, though? A user explicitly chose to disable a check because the aforementioned check cannot work on their infrastructure. I know in practice this obviously won't always be the case, but we should be able to trust that someone who uses an option like that has made sure everything else is in order, and even if they haven't or they have missed something, will realise that it is their own fault that they exceeded the rate limit because of this.
If I click Get Started on the Let's Encrypt website's home page, the first thing I see is Certbot. Certbot doesn't check for DNS propagation. It only checks the web server configuration. And even that is only by default, and can be cirumvented, because while sane, newbie-friendly defaults are good, there is no one-size-fits-all for everyone's workflow or setup.
If we really do insist to keep the self-check mandatory (which I personally don't think is the right solution here), a relatively simple fix could be to add an annotation that allows the IP address to be overridden. This would mean;
Any thoughts on this?
I am not insisting on keeping a mandatory self-check, but we should take in account what @anton-johansson discovered while implementing this. But we should disencourage it being used. It would be good to built upon https://github.com/jetstack/cert-manager/pull/2783 to have a well tested solution that gives an answer to being able to know when we're ready to serve without a self-check.
It's in the RFC: Clients SHOULD NOT respond to challenges until they believe that the server's queries will succeed. https://tools.ietf.org/html/rfc8555#section-8.2
In my view, I read this and understand that this above paragraph reads that the client should implement sanity check, and not necessarily endpoint validations. ie:. The pods readiness state should be enough to meet this RFC requirement. If it doesn't succeed, then there is another bug that needs to be fixed. If it's the ingress that is the issue, then it should be handled outside of cert manager.
Just my view.
Basically my words, I agree!
If you're self-hosting a k8s cluster and you can't use a hairpin nat, my workaround was adding a rewrite to your coredns config to send requests from your domain to your loadbalancer service
I also stuck with this issue when setting up egress with Firewall for AKS. Hope the solution will be out soon.
I'm also in this situation, where my k8s cluster is behind a NAT gateway. I only chased this issue for about 8 hours before I understood what was going on - mostly due to my service's application, my ingress, and my router all running Nginx with the default 404 page.
As I see it, there are currently four options to make cert-manager work, although adding a fifth option of a switch to disable self-check in cert-manager would be really appreciated.
In order of hard to easy, which is also best fix to worst fix ;-)
implement hair-pin routing in the NAT gateway
implement split DNS between the external Internet and the internal LAN
implement split DNS internally within k8s by adding a zone into core-dns
add hostAlias to cert-manager
I started by implementing (4) because it was the simplest.
I know the IP address on my LAN where my ingress-controller service is, since it's given a loadBalancerIP
from a MetalLB pool. In my NAT gateway I DNAT ports 80 and 443 from the Internet IP address for my site to this LAN address. So I just use that IP address in a hostAliases
node in the cert-manager's deployment.
I added a section similar to this to the cert-manager deployment, and restarted the deployment.
hostAliases:
- ip: 10.11.12.13
hostnames:
- example.org
- www.example.org
Hopefully this will be useful for future solution searchers.
In our case, the node running Nginx Ingress controller somehow is able to visit the HTTP01 endpoint.
So we used podAffinity to schedule cert-manager on the same node and it solves the issue.
Although this is a dirty/temporary solution, But worked for me. and also Hairpinning is not a best solution at all.
Is your feature request related to a problem? Please describe. Kinda intercects with #863, in nat nets cant successfully self validate acme rules, because of local k8s providers, which refuses to create hairpin nat
Describe the solution you'd like Nice env variable or cmd-flag for skip local self check and leave it up to the user
Describe alternatives you've considered Nothing
Environment details (if applicable):
/kind feature