Permission denied errors on AWS cause R53 DDoS

mitom commented 3 years ago

Duplicated:

https://github.com/cert-manager/cert-manager/issues/5486

https://github.com/cert-manager/cert-manager/issues/6308

https://github.com/cert-manager/cert-manager/issues/4061

Describe the bug: When a certificate is requested with DNS verification against a domain that cert-manager can't edit (and it isn't meant to be able to), it fails (as expected) but re-queues the job with the error message. The error message contains the session name, which when using IRSA seems to have a suffix that changes on each attempt:

│ E0525 12:12:27.402069       1 controller.go:158] cert-manager/controller/challenges "msg"="re-queuing item due to error processing" "error"="Failed to change Route 53 record set: AccessDenied: User: a │
│ rn:aws:sts::****:assumed-role/k8s-development-cert-manager/1621944747142088655 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/**** │
│ 16O9B" "key"="****/****"                                                                                                                     │
│ E0525 12:12:27.643748       1 controller.go:158] cert-manager/controller/challenges "msg"="re-queuing item due to error processing" "error"="Failed to change Route 53 record set: AccessDenied: User: a │
│ rn:aws:sts::****:assumed-role/k8s-development-cert-manager/1621944747402370662 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/**** │
│ 16O9B" "key"="****/****"

(note the /1621944747142088655 part in the error messages)

This seems to cause the error queing to treat these as individual different errors and not apply the backoff when re-trying. If left alone, eventually it nukes the R53 rate limit. It is the same effect as #3222 where @maelvls fixed the request ID part. In going back to it I can now see they would not have been able to reproduce the above issue since in their effort they used a user, which always showed up as the same principal in the error message, as opposed to using IRSA (or maybe even any role?) which exhibits this behaviour.

I am not sure what would be the way to overcome this, maybe just keep the error status (e.g. AccessDenied) as the queue key?

Expected behaviour: When cert manager can't fulfill a DNS verification request, back off and don't DDoS the provider.

Steps to reproduce the bug:

deploy cert manager with IRSA on EKS
set up a certificate in a zone that cert manager can't handle
watch logs for errors

Anything else we need to know?:

Environment details::

Kubernetes version: 1.18
Cloud-provider/provisioner: EKS
cert-manager version: 1.3.1
Install method: helm

/kind bug

maelvls commented 3 years ago

I thought I had fixed this issue in https://github.com/jetstack/cert-manager/pull/3485... 😅

I am not sure what would be the way to overcome this, maybe just keep the error status (e.g. AccessDenied) as the queue key?

I was thinking about something similar to this: when doing the diff between the old and the new Route53 error message, e.g.:

# Old
AccessDenied: User: arn:aws:sts::****:assumed-role/k8s-development-cert-manager/1621944747142088655 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/****

# New
AccessDenied: User: arn:aws:sts::****:assumed-role/k8s-development-cert-manager/1621944747402370662 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/****

We could just compare the first segment AccessDenied 😅 The aws-go-sdk makes it really difficult to know whether two errors come from a similar issue 😥

maelvls commented 3 years ago

/area acme/dns01 /priority important-soon

thedodd commented 2 years ago

Seeing a similar issue which would result in nuking R53 rate limit. Running v1.5.3.

jetstack-bot commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle rotten /remove-lifecycle stale

jetstack-bot commented 2 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten. Send feedback to jetstack. /close

jetstack-bot commented 2 years ago

@jetstack-bot: Closing this issue.

In response to [this](https://github.com/jetstack/cert-manager/issues/4061#issuecomment-1026447548): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. >Send feedback to [jetstack](https://github.com/jetstack). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

chriskim06 commented 2 years ago

also saw this on 1.7.1

...
E0314 21:52:39.738514       1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294759484156602 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:39.786142       1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294759550110356 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-2701034934"
E0314 21:52:39.923365       1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294759738805868 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:40.160166       1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294759923580447 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:40.375269       1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294760160448322 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:40.604273       1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294760375454212 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:40.837623       1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294760604472306 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:41.051045       1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294760837820211 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
...

seems like this is still an issue?

AdrianP873 commented 1 year ago

Also seeing this issue on 1.8.2

In my case however, the job failed because the cert-manager pod was unable to assume a cross account role using IRSA, and the exponential backoff did not apply:

Nov 29 15:59:59.162 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599249229417 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: 0dbf1684-fe05-46e2-8a10-1fb082d46b08" "key"="test/bar-test-tvmwl-41763899-4220912489"
Nov 29 15:59:59.106 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599207467915 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: 34cb9108-70fc-4f95-8fd6-cbf1bca9c709" "key"="test/bar-test-tvmwl-41763899-4220912489"
Nov 29 15:59:59.058 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599163773874 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: d9d50a1f-6570-4e7e-813f-754a0c7ae7d1" "key"="test/bar-test-tvmwl-41763899-4220912489"
Nov 29 15:59:59.003 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599106849109 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: 5f1b3e02-00d7-4b02-be65-59e550108b55" "key"="test/bar-test-tvmwl-41763899-4220912489"
Nov 29 15:59:58.956 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599059306645 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: 212e1b5f-08c7-4e06-9374-7c02237d663c" "key"="test/bar-test-tvmwl-41763899-4220912489"

jprenken commented 1 year ago

/reopen /remove-lifecycle rotten

jetstack-bot commented 1 year ago

@jprenken: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/cert-manager/cert-manager/issues/4061#issuecomment-1347695453): >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

irbekrm commented 1 year ago

/reopen /remove-lifecycle rotten

jetstack-bot commented 1 year ago

@irbekrm: Reopened this issue.

In response to [this](https://github.com/cert-manager/cert-manager/issues/4061#issuecomment-1348022604): >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

wallrj commented 1 year ago

Thanks everyone for investigating the root cause of this problem. I am not familiar with this part of cert-manager, but I have recently seen a similar problem in the Azure DNS01 driver.

https://github.com/cert-manager/cert-manager/pull/5663

If one of you has time to fix this and create a PR, please pester me in #cert-manager-dev on Slack for a review.

jetstack-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle rotten /remove-lifecycle stale

jetstack-bot commented 11 months ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten. Send feedback to jetstack. /close

jetstack-bot commented 11 months ago

@jetstack-bot: Closing this issue.

In response to [this](https://github.com/cert-manager/cert-manager/issues/4061#issuecomment-1635730088): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. >Send feedback to [jetstack](https://github.com/jetstack). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

JacobAmar commented 10 months ago

/reopen /remove-lifecycle rotten

jetstack-bot commented 10 months ago

@JacobAmar: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/cert-manager/cert-manager/issues/4061#issuecomment-1698697232): >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

wallrj commented 7 months ago

/reopen

jetstack-bot commented 7 months ago

@wallrj: Reopened this issue.

In response to [this](https://github.com/cert-manager/cert-manager/issues/4061#issuecomment-1812924283): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

jetstack-bot commented 4 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

wallrj commented 4 months ago

/remove-lifecycle stale

xref:

https://github.com/cert-manager/cert-manager/pull/6671

cert-manager-bot commented 1 month ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

cert-manager-bot commented 3 weeks ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. /lifecycle rotten /remove-lifecycle stale

cert-manager / cert-manager

Permission denied errors on AWS cause R53 DDoS #4061