Open mitom opened 3 years ago
I thought I had fixed this issue in https://github.com/jetstack/cert-manager/pull/3485... 😅
I am not sure what would be the way to overcome this, maybe just keep the error status (e.g. AccessDenied) as the queue key?
I was thinking about something similar to this: when doing the diff between the old and the new Route53 error message, e.g.:
# Old
AccessDenied: User: arn:aws:sts::****:assumed-role/k8s-development-cert-manager/1621944747142088655 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/****
# New
AccessDenied: User: arn:aws:sts::****:assumed-role/k8s-development-cert-manager/1621944747402370662 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/****
We could just compare the first segment AccessDenied
😅 The aws-go-sdk
makes it really difficult to know whether two errors come from a similar issue 😥
/area acme/dns01 /priority important-soon
Seeing a similar issue which would result in nuking R53 rate limit. Running v1.5.3
.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to jetstack.
/close
@jetstack-bot: Closing this issue.
also saw this on 1.7.1
...
E0314 21:52:39.738514 1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294759484156602 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:39.786142 1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294759550110356 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-2701034934"
E0314 21:52:39.923365 1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294759738805868 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:40.160166 1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294759923580447 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:40.375269 1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294760160448322 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:40.604273 1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294760375454212 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:40.837623 1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294760604472306 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
E0314 21:52:41.051045 1 controller.go:163] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="failed to change Route 53 record set: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager.k8s.test.us-east-1.centrio.com/1647294760837820211 is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::hostedzone/*** because no identity-based policy allows the route53:ChangeResourceRecordSets action" "key"="default/foo-test-9q9pv-1595503603-3894136473"
...
seems like this is still an issue?
Also seeing this issue on 1.8.2
In my case however, the job failed because the cert-manager pod was unable to assume a cross account role using IRSA, and the exponential backoff did not apply:
Nov 29 15:59:59.162 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599249229417 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: 0dbf1684-fe05-46e2-8a10-1fb082d46b08" "key"="test/bar-test-tvmwl-41763899-4220912489"
Nov 29 15:59:59.106 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599207467915 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: 34cb9108-70fc-4f95-8fd6-cbf1bca9c709" "key"="test/bar-test-tvmwl-41763899-4220912489"
Nov 29 15:59:59.058 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599163773874 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: d9d50a1f-6570-4e7e-813f-754a0c7ae7d1" "key"="test/bar-test-tvmwl-41763899-4220912489"
Nov 29 15:59:59.003 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599106849109 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: 5f1b3e02-00d7-4b02-be65-59e550108b55" "key"="test/bar-test-tvmwl-41763899-4220912489"
Nov 29 15:59:58.956 cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="error instantiating route53 challenge solver: unable to assume role: AccessDenied: User: arn:aws:sts::***:assumed-role/cert-manager/1669737599059306645 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::***:role/dev-cert-manager
\tstatus code: 403, request id: 212e1b5f-08c7-4e06-9374-7c02237d663c" "key"="test/bar-test-tvmwl-41763899-4220912489"
/reopen /remove-lifecycle rotten
@jprenken: You can't reopen an issue/PR unless you authored it or you are a collaborator.
/reopen /remove-lifecycle rotten
@irbekrm: Reopened this issue.
Thanks everyone for investigating the root cause of this problem. I am not familiar with this part of cert-manager, but I have recently seen a similar problem in the Azure DNS01 driver.
If one of you has time to fix this and create a PR, please pester me in #cert-manager-dev on Slack for a review.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to jetstack.
/close
@jetstack-bot: Closing this issue.
/reopen /remove-lifecycle rotten
@JacobAmar: You can't reopen an issue/PR unless you authored it or you are a collaborator.
/reopen
@wallrj: Reopened this issue.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to jetstack.
/lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle stale
Describe the bug: When a certificate is requested with DNS verification against a domain that cert-manager can't edit (and it isn't meant to be able to), it fails (as expected) but re-queues the job with the error message. The error message contains the session name, which when using IRSA seems to have a suffix that changes on each attempt:
(note the
/1621944747142088655
part in the error messages)This seems to cause the error queing to treat these as individual different errors and not apply the backoff when re-trying. If left alone, eventually it nukes the R53 rate limit. It is the same effect as #3222 where @maelvls fixed the request ID part. In going back to it I can now see they would not have been able to reproduce the above issue since in their effort they used a user, which always showed up as the same principal in the error message, as opposed to using IRSA (or maybe even any role?) which exhibits this behaviour.
I am not sure what would be the way to overcome this, maybe just keep the error status (e.g. AccessDenied) as the queue key?
Expected behaviour: When cert manager can't fulfill a DNS verification request, back off and don't DDoS the provider.
Steps to reproduce the bug:
Anything else we need to know?:
Environment details::
/kind bug