Closed thurianknight closed 1 month ago
Can you see logs in the azure application gateway (your ingress controller) when you access the URL either VIA curl?
Also you mentioned 2 of 3 worked. Are the other 2 on a similar or the same top level domain, or different subdomains?
Can you share the Ingress
resources that do work?
Hey Hawksight, thanks for responding.
The 3 resources are all different hosts in the same domain. Here's an example of an ingress that is working:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
appgw.ingress.kubernetes.io/health-probe-status-codes: 200-399, 401-404
appgw.ingress.kubernetes.io/ssl-redirect: "true"
cert-manager.io/acme-challenge-type: http01
cert-manager.io/cluster-issuer: letsencrypt-prod
kubernetes.io/ingress.class: azure/application-gateway
labels:
app: test-apiissuer
role: test-apiissuer-ingress
name: test-apiissuer-ingress
namespace: yellowstone-databus
spec:
rules:
- host: py-yellowstone-apiissuer.mydomain.com
http:
paths:
- backend:
service:
name: test-apiissuer-front-service
port:
number: 8080
path: /
pathType: Prefix
tls:
- hosts:
- py-yellowstone-apiissuer.mydomain.com
secretName: ys-apiissuer-cert
And here's the one that is failing:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
appgw.ingress.kubernetes.io/health-probe-status-codes: 200-399, 401-404
appgw.ingress.kubernetes.io/rewrite-rule-set: SecurityHeadersRuleSet
appgw.ingress.kubernetes.io/ssl-redirect: "true"
cert-manager.io/acme-challenge-type: http01
cert-manager.io/cluster-issuer: letsencrypt-prod
kubernetes.io/ingress.class: azure/application-gateway
labels:
app: test-api
role: test-api-ingress
name: test-api-ingress
namespace: yellowstone-databus
spec:
rules:
- host: py-yellowstone-api.mydomain.com
http:
paths:
- backend:
service:
name: test-api-front-service
port:
number: 8080
path: /
pathType: Prefix
tls:
- hosts:
- py-yellowstone-api.mydomain.com
secretName: ys-api-cert
They are so similar as to be laughable, since one works and the other fails. I have deleted and redeployed (with various changes) multiple times, but nothing has helped me to fix the issue or even figure out why the solver pod is returning a 404 error on the challenge request.
And I can completely bypass the ingress by using curl directly to the solver pod's IP -- it's giving a 404 error from the pod itself, same as when I do go through the ingress.
The DNS records resolve to Akamai, but that tends to obfuscate errors coming from the origin, so I often use hosts file entries to go directly to the public IP of the AKS app gateway. But no matter what I do -- including as I mentioned, going directly to the pod in the cluster -- I get that blasted 404 error.
I should add, I upgraded cert-manager to 1.14.5 today but that also did not help.
So... I appear to have it fixed. Going to wait until tomorrow to do some more testing, then will post my findings here for others to maybe benefit from.
OK, my situation was probably somewhat unique. But for posterity's sake, I'll document it here.
Before deploying cert-manager, we had previously been testing a simple nginx web server that would serve up acme challenge pages for lets-encrypt. Basically the same thing that cert-manager does, but this setup was manually managed. We had an ingress configured for /.well-known/acme-challenge/, so that any challenge token could be served from that path.
All we had to do was modify the ingress for whatever host/domain names we needed... and that happened to include the one that has been causing me problems for the last couple of days.
So this preexisting ingress configuration conflicted with the challenge ingress that cert-manager wanted to create. The end result was that the challenge pod was deployed and running, but it did not have the challenge token to serve up, which meant we were getting that 404 error from it when I did a GET request directly to the pod.
Simultaneously, any GET requests to the FQDN of the site were being directed to that old nginx web server (based on the ingress rule pattern match), and since the request was for a path/token that did not exist, the nginx server was also returning 404 to the requester.
The solution was to delete the namespace for that old nginx setup, which removed all the resources including the old ingress, and cert-manager immediately started working correctly, completed the challenge, and refreshed my ssl cert.
Re: Slack: I tried to setup a new Slack account so that I could post there, but after four or five failed attempts with weird errors from the Slack website, I gave up -- not incompetent, just impatient. Posting here instead.
Describe the bug: I deployed 3 new sets of resources into my AKS cluster, with their own ingresses and certificates. 2 of them worked perfectly, but 1 of them is hung up on that 404 error from the solver.
The solver pod is running of course. I can access it via curl and reproduce the 404 error like this:
curl -v http://172.20.0.xxx:8089/.well-known/acme-challenge/THIS_IS_MY_TOKEN
And the results:
Going through the public ingress gives the same 404 result.
Describing the challenge gives this status and events:
As mentioned, I deployed 3 sites within moments of each other, and 2 of them worked perfectly. And this system overall has been working perfectly for over 6 months. This is the first time I have encountered this issue.
I have tried deleting the order, challenge, solver pod and ingress and recreating the ingress. This recreated all the other resources in turn, as expected... with the same 404 error coming from the solver pod.
The solver pod has this in the log:
I realize that the solver pods are running a very stripped down Linux "non-distro", so I'm not able to run any commands against it like
env
orcat /var/log
or anything else that I have tried.Anyone have advice for how to troubleshoot/resolve this?
Expected behaviour: The solver pod should return http 200 and the challenge token.
Steps to reproduce the bug: Deploy a new ingress with a known-good domain name
Ingress yaml:
Anything else we need to know?:
Environment details::
/kind bug