cert-manager / cert-manager

Automatically provision and manage TLS certificates in Kubernetes
https://cert-manager.io
Apache License 2.0
12.04k stars 2.07k forks source link

ACME HTTP01 Expired authorization on long running pods #1424

Closed laurieodgers closed 5 years ago

laurieodgers commented 5 years ago

Background: We have a unique use case in that we are using cert-manager to retrieve SSL certs for customer domains, as we offer wholesale services to our customers. We have had at least 50 acme challenge http01 pods running for ~26 days while we wait for our wholesale customers to change over their DNS from our old portal to new portal.

Describe the bug: While testing the DNS cutover internally, an issue with authorization was encountered. ACME HTTP01 challenge pods don't seem to be updated with new authorizations from letsencrypt.

Expected behaviour: Acme challenge pods should have the correct authorization when letsencrypt update their side, allowing the certificate to be issued.

Steps to reproduce the bug: Run a HTTP01 acme challenge pod for long enough for the authorization to expire.

Anything else we need to know?: This may well be solved by the fix to my previous bug report #1311 and subsequent PR #1388 but this hasn't been released in an official version yet so I've been unable to test.

Workaround is to delete/recreate the relevant certificate/ingress k8s api objects.

Logs: I challenges controller: syncing item 'web/whitelabel-customer1-0' I whitelabel-customer1-0: Error accepting challenge: acme: urn:ietf:params:acme:error:malformed: Expired authorization E challenges controller: Re-queuing item "web/whitelabel-customer1-0" due to error processing: acme: urn:ietf:params:acme:error:malformed: Expired authorization I orders controller: syncing item 'web/whitelabel-customer1' I challenges controller: syncing item 'web/whitelabel-customer1-0' I Waiting for all challenges for order "whitelabel-customer1" to enter 'valid' state I orders controller: Finished processing work item "web/whitelabel-customer1" I whitelabel-customer1-0: Error accepting challenge: acme: urn:ietf:params:acme:error:malformed: Expired authorization E challenges controller: Re-queuing item "web/whitelabel-customer1-0" due to error processing: acme: urn:ietf:params:acme:error:malformed: Expired authorization I challenges controller: syncing item 'web/whitelabel-customer1-0' I whitelabel-customer1-0: Error accepting challenge: acme: urn:ietf:params:acme:error:malformed: Expired authorization E challenges controller: Re-queuing item "web/whitelabel-customer1-0" due to error processing: acme: urn:ietf:params:acme:error:malformed: Expired authorization I challenges controller: syncing item 'web/whitelabel-customer1-0' I whitelabel-customer1-0: Error accepting challenge: acme: urn:ietf:params:acme:error:malformed: Expired authorization E challenges controller: Re-queuing item "web/whitelabel-customer1-0" due to error processing: acme: urn:ietf:params:acme:error:malformed: Expired authorization I challenges controller: syncing item 'web/whitelabel-customer1-0' I whitelabel-customer1-0: Error accepting challenge: acme: urn:ietf:params:acme:error:malformed: Expired authorization E challenges controller: Re-queuing item "web/whitelabel-customer1-0" due to error processing: acme: urn:ietf:params:acme:error:malformed: Expired authorization I challenges controller: syncing item 'web/whitelabel-customer1-0' I ingress-shim controller: syncing item 'web/whitelabel-a5eqwj3xcxhtrhus5deoxlla36zqszqq' I Not syncing ingress web/whitelabel-customer1 as it does not contain necessary annotations I ingress-shim controller: Finished processing work item "web/whitelabel-customer1"

Environment details::

/kind bug

cyberaleks commented 5 years ago

+1 Same issue

munnerz commented 5 years ago

I don't think your issues will be resolved by that PR. This is actually a known issues, albeit only known as a TODO here: https://github.com/jetstack/cert-manager/blob/master/pkg/controller/acmeorders/sync.go#L136-L137

We've erred on the side of caution here, so we don't over-query the ACME server for this information. We'll need to implement some kind of periodic resync of all pending challenge & order resources.

/remove-kind bug /kind feature /priority important-longterm /area acme

aude commented 5 years ago

A couple of my HTTPS certs have also just expired, 90 days after creation:

$ kubectl get secret
NAME                             TYPE                                  DATA   AGE
...
something-tls                    kubernetes.io/tls                     3      89d
otherthing-tls                   kubernetes.io/tls                     2      109d
...
niftyapp-tls                     kubernetes.io/tls                     2      110d
soupmachine-tls                  kubernetes.io/tls                     2      90d
...

I solved the issue by deleting the stale certificate, which triggered generation of a new certificate, and it solved the issue.

$ kubectl delete secret soupmachine-tls
secret "soupmachine-tls" deleted
$ # wait a bit... maybe check progress with `kubectl logs -n cert-manager cert-manager-*** -f`
$ kubectl get secret
NAME                             TYPE                                  DATA   AGE
...
something-tls                    kubernetes.io/tls                     3      89d
otherthing-tls                   kubernetes.io/tls                     2      109d
...
niftyapp-tls                     kubernetes.io/tls                     2      110d
soupmachine-tls                  kubernetes.io/tls                     2      3m
...

This is how I managed to fix the "Expired authorization" issue.


I installed cert-manager around 4 months ago.

Then, I upgraded cert-manager around 1 month ago. That didn't go well, because I had not read the upgrade guide. So I reinstalled cert-manager, but it seems I kept the old certs (judging by the age of the secrets showed above).

As I understand @munnerz's reply, the thing that happened here is that my local cert-manager's view of when the cert needs to be updated is out of sync with Let's Encrypt's view. The "timer" might have been reset when I re-installed cert-manager.

Please correct me if my theory is wrong, I'd just like to understand the issue.

Edit

I also noticed I have almost no orders.certmanager.k8s.io. Much less than certificates.certmanager.k8s.io. That could be relevant as well.

$ kubectl get orders.certmanager.k8s.io --no-headers --all-namespaces | wc -l
3
$ kubectl get certificates.certmanager.k8s.io --no-headers --all-namespaces | wc -l
7
munnerz commented 5 years ago

1589 tracks another case of this and proposes a solution.

In your particular case, it'd be more ideal to not actually start the Order flow until you know the self check will pass (i.e. perform the self check with some dummy values ahead of time).

cert-manager is not currently well geared for this use case and would require some extra work to make it possible.

If you wanted to achieve this today, you might be able to create/inject fake 'Challenge' resources in order to trigger at least the self check stage of the authorization flow, but you may run into some weird issues.

laurieodgers commented 5 years ago

Understood. I've gone ahead and put some checks to ensure that DNS has been cut over before creating certificate/ingress objects within k8s. This will help us save on resource usage as well.

I'm satisfied with this outcome so its up to you if you wish to close or keep this ticket open.

Thanks for the help and for the great piece of software!

eugenestarchenko commented 5 years ago

Got almost into the same situation as describe above but i read the upgrade guide... Jump from 0.7 to 0.7.2 versions hoping it will resolve this error... The only one thing that I don`t clearly understand is if cert manager was redeployed into new namespace (cert-manager) and old certs with secrets were created earlier in "default" namespaces -> can this cause the situation described above? Should I "re-order" new certs by deleting the old ones or cert-manager can handle that automatically? Looks like i need to update to 0.8.0 due to https://github.com/jetstack/cert-manager/pull/1603

PS Solved that by deletion of old certs and secrets.

➜  ~ kubectl get pods --all-namespaces
NAMESPACE      NAME                                             READY   STATUS      RESTARTS   AGE
cert-manager   cert-manager-6f5c7f9bff-w94ql                    1/1     Running     0          2h
cert-manager   cert-manager-cainjector-64c799c8f9-59p69         1/1     Running     0          2h
cert-manager   cert-manager-webhook-7646699c48-6kbhv            1/1     Running     0          2h
default        cm-acme-http-solver-25h87                        1/1     Running     0          2h
default        cm-acme-http-solver-9wqqs                        1/1     Running     0          2h
default        cm-acme-http-solver-hhq86                        1/1     Running     0          2h
default        cm-acme-http-solver-tpjt7                        1/1     Running     0          2h
kubectl describe challenge

Status:
  Presented:   true
  Processing:  true
  Reason:      Error accepting challenge: acme: urn:ietf:params:acme:error:malformed: Expired authorization
  State:       pending
kubectl get orders
NAME                                              STATE     AGE
XXX                  pending   10d
XXX    pending   22d
XXX   pending   22d
XXX      pending   22d
retest-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

retest-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle rotten /remove-lifecycle stale

retest-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten. Send feedback to jetstack. /close

jetstack-bot commented 5 years ago

@retest-bot: Closing this issue.

In response to [this](https://github.com/jetstack/cert-manager/issues/1424#issuecomment-541421525): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. >Send feedback to [jetstack](https://github.com/jetstack). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
andreisaikouski commented 2 years ago

I see this issue was closed by a bot. has anyone found a resolution or root cause?

eddiewebb commented 1 year ago

We had to delete epxired challenges