Solver pod returns 404 error during http01 challenge

thurianknight commented 1 month ago

Re: Slack: I tried to setup a new Slack account so that I could post there, but after four or five failed attempts with weird errors from the Slack website, I gave up -- not incompetent, just impatient. Posting here instead.

Describe the bug: I deployed 3 new sets of resources into my AKS cluster, with their own ingresses and certificates. 2 of them worked perfectly, but 1 of them is hung up on that 404 error from the solver.

The solver pod is running of course. I can access it via curl and reproduce the 404 error like this: curl -v http://172.20.0.xxx:8089/.well-known/acme-challenge/THIS_IS_MY_TOKEN

And the results:

*   Trying 172.20.0.xxx:8089...
* TCP_NODELAY set
* Connected to 172.20.0.xxx (172.20.0.xxx) port 8089 (#0)
> GET /.well-known/acme-challenge/THIS_IS_MY_TOKEN
> Host: 172.20.0.xxx:8089
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 07 May 2024 15:38:27 GMT
< Content-Length: 19
< 
404 page not found

Going through the public ingress gives the same 404 result.

Describing the challenge gives this status and events:

Status:
  Presented:   true
  Processing:  true
  Reason:      Waiting for HTTP-01 challenge propagation: wrong status code '404', expected '200'
  State:       pending
Events:
  Type    Reason     Age   From                     Message
  ----    ------     ----  ----                     -------
  Normal  Started    50m   cert-manager-challenges  Challenge scheduled for processing
  Normal  Presented  50m   cert-manager-challenges  Presented challenge using HTTP-01 challenge mechanism

As mentioned, I deployed 3 sites within moments of each other, and 2 of them worked perfectly. And this system overall has been working perfectly for over 6 months. This is the first time I have encountered this issue.

I have tried deleting the order, challenge, solver pod and ingress and recreating the ingress. This recreated all the other resources in turn, as expected... with the same 404 error coming from the solver pod.

The solver pod has this in the log:

I0507 15:47:42.695262 1 solver.go:87] "cert-manager/acmesolver: got successful challenge request, writing key" host="my.domain.com" path="/.well-known/acme-challenge/THIS_IS_MY_TOKEN" base_path="/.well-known/acme-challenge" token="THIS_IS_MY_TOKEN"

I realize that the solver pods are running a very stripped down Linux "non-distro", so I'm not able to run any commands against it like env or cat /var/log or anything else that I have tried.

Anyone have advice for how to troubleshoot/resolve this?

Expected behaviour: The solver pod should return http 200 and the challenge token.

Steps to reproduce the bug: Deploy a new ingress with a known-good domain name

Ingress yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    appgw.ingress.kubernetes.io/health-probe-status-codes: 200-399, 401-404
    appgw.ingress.kubernetes.io/rewrite-rule-set: SecurityHeadersRuleSet
    appgw.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/acme-challenge-type: http01
    cert-manager.io/cluster-issuer: letsencrypt-prod
    kubernetes.io/ingress.class: azure/application-gateway
  labels:
    app: test-api
    role: test-api-ingress
  name: test-api-ingress
  namespace: my-namespace
spec:
  rules:
  - host: my.domain.name
    http:
      paths:
      - backend:
          service:
            name: test-api-front-service
            port:
              number: 8080
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - my.domain.name
    secretName: ys-api

Anything else we need to know?:

Environment details::

Kubernetes version:
- 1.28.5
Cloud-provider/provisioner:
- Azure Kubernetes Service
cert-manager version:
- 1.13.2
Install method: e.g. helm/static manifests
- manifests

/kind bug

hawksight commented 1 month ago

Can you see logs in the azure application gateway (your ingress controller) when you access the URL either VIA curl?

Also you mentioned 2 of 3 worked. Are the other 2 on a similar or the same top level domain, or different subdomains? Can you share the Ingress resources that do work?

thurianknight commented 1 month ago

Hey Hawksight, thanks for responding.

The 3 resources are all different hosts in the same domain. Here's an example of an ingress that is working:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    appgw.ingress.kubernetes.io/health-probe-status-codes: 200-399, 401-404
    appgw.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/acme-challenge-type: http01
    cert-manager.io/cluster-issuer: letsencrypt-prod
    kubernetes.io/ingress.class: azure/application-gateway
  labels:
    app: test-apiissuer
    role: test-apiissuer-ingress
  name: test-apiissuer-ingress
  namespace: yellowstone-databus
spec:
  rules:
  - host: py-yellowstone-apiissuer.mydomain.com
    http:
      paths:
      - backend:
          service:
            name: test-apiissuer-front-service
            port:
              number: 8080
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - py-yellowstone-apiissuer.mydomain.com
    secretName: ys-apiissuer-cert

And here's the one that is failing:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    appgw.ingress.kubernetes.io/health-probe-status-codes: 200-399, 401-404
    appgw.ingress.kubernetes.io/rewrite-rule-set: SecurityHeadersRuleSet
    appgw.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/acme-challenge-type: http01
    cert-manager.io/cluster-issuer: letsencrypt-prod
    kubernetes.io/ingress.class: azure/application-gateway
  labels:
    app: test-api
    role: test-api-ingress
  name: test-api-ingress
  namespace: yellowstone-databus
spec:
  rules:
  - host: py-yellowstone-api.mydomain.com
    http:
      paths:
      - backend:
          service:
            name: test-api-front-service
            port:
              number: 8080
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - py-yellowstone-api.mydomain.com
    secretName: ys-api-cert

They are so similar as to be laughable, since one works and the other fails. I have deleted and redeployed (with various changes) multiple times, but nothing has helped me to fix the issue or even figure out why the solver pod is returning a 404 error on the challenge request.

And I can completely bypass the ingress by using curl directly to the solver pod's IP -- it's giving a 404 error from the pod itself, same as when I do go through the ingress.

The DNS records resolve to Akamai, but that tends to obfuscate errors coming from the origin, so I often use hosts file entries to go directly to the public IP of the AKS app gateway. But no matter what I do -- including as I mentioned, going directly to the pod in the cluster -- I get that blasted 404 error.

thurianknight commented 1 month ago

I should add, I upgraded cert-manager to 1.14.5 today but that also did not help.

thurianknight commented 1 month ago

So... I appear to have it fixed. Going to wait until tomorrow to do some more testing, then will post my findings here for others to maybe benefit from.

thurianknight commented 1 month ago

OK, my situation was probably somewhat unique. But for posterity's sake, I'll document it here.

Before deploying cert-manager, we had previously been testing a simple nginx web server that would serve up acme challenge pages for lets-encrypt. Basically the same thing that cert-manager does, but this setup was manually managed. We had an ingress configured for /.well-known/acme-challenge/, so that any challenge token could be served from that path.

All we had to do was modify the ingress for whatever host/domain names we needed... and that happened to include the one that has been causing me problems for the last couple of days.

So this preexisting ingress configuration conflicted with the challenge ingress that cert-manager wanted to create. The end result was that the challenge pod was deployed and running, but it did not have the challenge token to serve up, which meant we were getting that 404 error from it when I did a GET request directly to the pod.

Simultaneously, any GET requests to the FQDN of the site were being directed to that old nginx web server (based on the ingress rule pattern match), and since the request was for a path/token that did not exist, the nginx server was also returning 404 to the requester.

The solution was to delete the namespace for that old nginx setup, which removed all the resources including the old ingress, and cert-manager immediately started working correctly, completed the challenge, and refreshed my ssl cert.

cert-manager / cert-manager

Solver pod returns 404 error during http01 challenge #6990