cert-manager / cert-manager

Automatically provision and manage TLS certificates in Kubernetes
https://cert-manager.io
Apache License 2.0
11.7k stars 2.03k forks source link

Route53 Provider Assume Role Error - Missing Region #7102

Open pchang388 opened 2 weeks ago

pchang388 commented 2 weeks ago

Problem: After upgrade to v1.15.0 from v1.14.4 and upgrading CRDs beforehand, I am no longer able to manually trigger a renewal via cmctl. When attempting to do so, these messages show up in the cert-manager pod logs

E0617 07:23:49.725504       1 controller.go:162] "re-queuing item due to error processing" err="error instantiating route53 challenge solver: unable to assume role: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region" logger="cert-manager.controller"

This worked previously in the past and my ClusterIssuer configuration hasn't been an issue. The region field is specified and It looks like:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod-issuer
  namespace: cert-manager
spec:
  acme:
    email: <REDACTED>
    privateKeySecretRef:
      name: letsencrypt-prod-key
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
      - dns01:
          route53:
            region: us-east-2
            accessKeyID:  <REDACTED>
            secretAccessKeySecretRef:
              name: prd-route53-credentials-secret
              key: secret-access-key
            hostedZoneID:  <REDACTED>
            role: <REDACTED>

cmctl was used as mentioned and shows version:

$ cmctl version
Client Version: util.Version{GitVersion:"v1.14.2", GitCommit:"306e329365989f205185024a86de9b9d4bad10a5", GitTreeState:"", GoVersion:"go1.21.7", Compiler:"gc", Platform:"linux/amd64"}

Expected behaviour: Assume role works with AWS Route53 provider as it has in previous versions of cert-manager.

Steps to reproduce the bug:

Environment details::

/kind bug

pchang388 commented 2 weeks ago

I took a quick look at the changes made in v1.15.0 as mentioned in the release notes (#6878) but at a surface/just diffs level, I didn't see anything that would cause this but I went ahead and downgraded back to v1.14.4 to see if the issue pops there. It does not, cert-manager was able to renew the certificate that was pending from v1.15.0 manual renewal.

Downgrade steps followed (couldn't find exactly if this was supported/recommended):

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.crds.yaml

helm rollback cert-manager -n cert-manager

From what I can tell so far, it appears to be an issue introduced in v1.15.0, I pretty much followed the guide (excluding the cross account access or IRSA stuff) to set up the Route53 Provider and ClusterIssuer and didn't have problems before.

cwyl02 commented 1 week ago

would it be a good idea to override AWS_REGION to be aws-global to serve as a temporary workaround for v1.15.0? I think this could be OK since the route53 challenge is the only AWS API cert-manager uses?

hongbo-miao commented 6 days ago

Just add more info which may help debug.

I met same error on v1.15.1 (both Helm chart and cert-manager.crds.yaml are same version):

error instantiating route53 challenge solver: unable to assume role: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region"

Downgraded to v1.14.7 works for me as a temporary solution.

k11h-de commented 6 days ago

would it be a good idea to override AWS_REGION to be aws-global to serve as a temporary workaround for v1.15.0? I think this could be OK since the route53 challenge is the only AWS API cert-manager uses?

@cwyl02 Setting this env in helm worked for me as a "workaround" but it looks like cleanup is no longer working - at least for me

extraEnv:
  - name: AWS_REGION
    value: 'aws-global'

error:

E0628 11:42:46.460550       1 sync.go:283] "error cleaning up challenge" err="failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: <REDACTED>, InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.xxxxxxxxxx.', type='TXT', set-identifier='\"yyyyyyyyyyyy\"'] but it was not found]" logger="cert-manager.controller.finalizer" resource_name="aaaaaaaaaaaa" resource_namespace="bbbbbbb" resource_kind="Challenge" resource_version="v1" dnsName="foo.bar.com" type="DNS-01"

I guess it is a good idea to wait for https://github.com/cert-manager/cert-manager/pull/7108 to be merged (the PR is waiting for @pchang388 to confirm this solves the issue)

wallrj commented 2 days ago

I guess it is a good idea to wait for #7108 to be merged (the PR is waiting for @pchang388 to confirm this solves the issue)

@hongbo-miao , @cwyl02 , @k11h-de We'd be grateful if any of you could test that PR.