cert-manager / cert-manager

Automatically provision and manage TLS certificates in Kubernetes
https://cert-manager.io
Apache License 2.0
11.89k stars 2.05k forks source link

RFC2136 challenge update queries fail silently if target nameserver listens on UDP but forces re-querying over TCP #6413

Closed Tristan971 closed 2 weeks ago

Tristan971 commented 10 months ago

Describe the bug: When using an RFC2136-type issuer, the record update query is sent using UDP, and never retried with TCP in case of being replied to with TC=1 by the nameserver.

Expected behaviour: Such a situation should lead to one of:

  1. Preferrably a retry over TCP (assuming this is not subject to some DNS spec exception)
  2. At least an error message

Steps to reproduce the bug:

  1. Set up a nameserver that the RFC2136 issuer is to interact with
  2. Configure that nameserver to always force clients to use TCP (ie reply with TC=1 no matter what the input query was; including update queries)
  3. Trigger a certificate issuance, then notice no error message and the process getting stuck waiting forever for an updated record that will never be published

Anything else we need to know?:

Here's an excerpt of the associated logs at logLevel 6 which shows the issue better (annotated)

[...]

// presenting challenge
2023-10-13 01:18:15.251 I1013 00:18:15.251415       1 dns.go:88] "cert-manager/challenges/Present: presenting DNS01 challenge for domain" resource_name="mangadex-dev-1-199145890-3645171849" resource_namespace="cert-manager" resource_kind="Challenge" resource_version="v1" dnsName="mangadex.dev" type="DNS-01" resource_name="mangadex-dev-1-199145890-3645171849" resource_namespace="cert-manager" resource_kind="Challenge" resource_version="v1" domain="mangadex.dev"
2023-10-13 01:18:15.265 I1013 00:18:15.262292       1 rfc2136.go:57] "cert-manager: Creating RFC2136 Provider"
2023-10-13 01:18:15.265 I1013 00:18:15.262652       1 rfc2136.go:84] DNSProvider nameserver:       10.233.1.53:53
2023-10-13 01:18:15.265 I1013 00:18:15.262877       1 rfc2136.go:85]             tsigAlgorithm:    hmac-sha512.
2023-10-13 01:18:15.265 I1013 00:18:15.263015       1 rfc2136.go:86]             tsigKeyName:      key_xfr_certmanager
2023-10-13 01:18:15.265 I1013 00:18:15.263148       1 rfc2136.go:93]             tsigSecret:       [...]

// cert-manager is convinced that presentation was successful
2023-10-13 01:18:15.268 I1013 00:18:15.268141       1 dns.go:116] "cert-manager/challenges/Check: checking DNS propagation" [...]
2023-10-13 01:18:15.268 I1013 00:18:15.268532       1 logs.go:206] "cert-manager/controller: Event(v1.ObjectReference{...}): type: 'Normal' reason: 'Presented' Presented challenge using DNS-01 challenge mechanism"

// cert-manager moves on to checking on auth NS
// correctly retrying with TCP in this case, but the update will never come
[...]
2023-10-13 01:18:15.339 I1013 00:18:15.339459       1 wait.go:397] Returning authoritative nameservers [ns1.mangadex.dev., ns2.mangadex.dev.]
2023-10-13 01:18:15.386 I1013 00:18:15.385635       1 wait.go:202] UDP dns lookup failed, retrying with TCP: <nil>
2023-10-13 01:18:15.391 I1013 00:18:15.390739       1 wait.go:145] Looking up TXT records for "dns01.acme.dev.mangadex.tech."

An easy solution for us was allowing UDP towards our hidden master (which is sitting inside our cluster, hence the issuer's nameserver's IP), since we can then keep our TCP-only policy on our publicly exposed replicas.

The issue here is twofold, in here:

https://github.com/cert-manager/cert-manager/blob/b53527eb787c508a2dc0a27853cd4eb4b138faf6/pkg/issuer/acme/dns/rfc2136/rfc2136.go#L137-L144

  1. A truncated response seemingly doesn't get caught as an error since it is still a "correct" response (as in valid and non-error-code-based)
  2. There is no successfully-sent-query log even at the max log level, making this quite a bit more difficult than necessary to debug imo, even if the status quo was meant to be kept

Environment details::

/kind bug

guerzon commented 10 months ago

Hey @inteon, I would like to try to work on this issue. Kindly assign to me, thanks!

jetstack-bot commented 7 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

arsenalzp commented 7 months ago

Hello! It is omnipresent Oleksandr again. Are you planning to include this fix into new release? If no, I would like to work on this.

inteon commented 7 months ago

@arsenalzp Yes, you can claim this issue. We will gladly accept a PR that fixes this issue 👍.

jetstack-bot commented 6 months ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. /lifecycle rotten /remove-lifecycle stale

inteon commented 6 months ago

/remove-lifecycle rotten

cert-manager-bot commented 3 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

cert-manager-bot commented 2 months ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. /lifecycle rotten /remove-lifecycle stale

erikgb commented 1 month ago

/priority backlog

cert-manager-bot commented 2 weeks ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten. /close

cert-manager-prow[bot] commented 2 weeks ago

@cert-manager-bot: Closing this issue.

In response to [this](https://github.com/cert-manager/cert-manager/issues/6413#issuecomment-2290926132): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.