aws-controllers-k8s / community

AWS Controllers for Kubernetes (ACK) is a project enabling you to manage AWS services from Kubernetes
https://aws-controllers-k8s.github.io/community/
Apache License 2.0
2.36k stars 248 forks source link

Race condition between reading the ack-role-account-map ConfigMap and resyncing breaks CARM #2011

Open nampnguyen opened 4 months ago

nampnguyen commented 4 months ago

Describe the bug When a controller starts while using CARM, particularly if the pod has a CPU limit specified (~250m), the account config map can sometimes be read after reconciliation starts. When this happens, the controller no longer assumes the appropriate cross account role and results in AccessDenied errors because the IRSA role does not have permissions to resources in other accounts.

The problem seems to be getting worse over time and may be related to the number of namespaces.

Even after the ack-role-account-map is read, the controller does not use the correct cross account role.

Debug logs (with some redactions):

{"level":"info","ts":"2024-02-09T19:54:29.148Z","logger":"setup","msg":"initializing service controller","aws.service":"dynamodb"}
{"level":"debug","ts":"2024-02-09T19:54:29.148Z","logger":"cache.namespace","msg":"Starting namespace cache","watchScope":[],"ignored":["kube-system","kube-public","kube-node-lease"]}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-1"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-2"}
...
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-39"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-40"}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Backup","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"GlobalTable","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Table","resync period seconds":36000}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"setup","msg":"starting manager","aws.service":"dynamodb"}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
...
{"level":"info","ts":"2024-02-09T19:54:29.171Z","msg":"Starting Controller","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table"}
{"level":"info","ts":"2024-02-09T19:54:29.273Z","msg":"Starting workers","controller":"adoptedresource","controllerGroup":"services.k8s.aws","controllerKind":"AdoptedResource","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"fieldexport","controllerGroup":"services.k8s.aws","controllerKind":"FieldExport","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"debug","ts":"2024-02-09T19:54:29.277Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"> r.Sync","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">>> rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.311Z","logger":"cache.account","msg":"created account config map","name":"ack-role-account-map"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<<< rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<< rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}

Steps to reproduce

  1. Configure CARM with a large number of namespaces (~30-40) and a CPU limit on the controller pod
  2. Create an ACK resource
  3. Restart the controller pod to trigger reloading of the config map and reconciliation

Expected outcome Resource resyncing to wait until after the ack-role-account-map ConfigMap is read or future resyncs pivots to the cross account role if the ConfigMap is read after resyncing starts.

Environment

a-hilaly commented 4 months ago

@nampnguyen Is this happening only during controller startup, or have you observed it persisting even after multiple reconciliations?

We can change a few things in the ACK runtime to mitigate this:

Quick note: We really need to start stress and load testing for ACK controllers, especially for components like CARM watchers.. i'm pretty sure there is a lot we can catch in there.

nampnguyen commented 4 months ago

@a-hilaly In our experience, if the reconciler starts before the ack-role-account-map ConfigMap is read, the issue persists for the life of the controller. The only option is to restart the controller and roll the dice again.

When we originally saw this issue when we had a CPU limit of 250m, but then increased it to 2000m. After the limit increase, the frequency of reconciler errors went down significantly. We implemented a cronjob to check for reconciler errors and restarted the controller if that happened which took care of the few times it popped back up. However, since that time the number of namespaces and accounts used by ACK has grown and we now see some controllers continue to experience reconciler errors even after dozens of restarts.

a-hilaly commented 4 months ago

@nampnguyen we starting rolling out a fix for all the controllers - i'll ping here once dynamodb controller is patched.

nampnguyen commented 4 months ago

@a-hilaly Saw the controllers PRs that included the 0.31.0 runtime get closed, is the fix being re-worked? Thanks!

a-hilaly commented 4 months ago

@nampnguyen Looks like we also needed https://github.com/aws-controllers-k8s/runtime/commit/861f7ed8ee62985c9a1ec7dd5e6a2a47de565e6c - we're rolling a new patch today :)

nampnguyen commented 3 months ago

@a-hilaly We tried the latest controllers using runtime 0.32.0, however are still seeing the issue. Debug logs from iam-controller 1.3.5:

{"level":"info","ts":"2024-03-11T13:47:53.770Z","logger":"setup","msg":"initializing service controller","aws.service":"iam"}
{"level":"debug","ts":"2024-03-11T13:47:53.770Z","logger":"cache.account","msg":"Starting shared informer for accounts cache","targetConfigMap":"ack-role-account-map"}
{"level":"debug","ts":"2024-03-11T13:47:53.770Z","logger":"cache.namespace","msg":"Starting namespace cache","watchScope":[],"ignored":["kube-system","kube-public","kube-node-lease"]}
{"level":"debug","ts":"2024-03-11T13:47:53.788Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-1"}
{"level":"debug","ts":"2024-03-11T13:47:53.788Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-13"}
{"level":"debug","ts":"2024-03-11T13:47:53.816Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"OpenIDConnectProvider","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.816Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Policy","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.816Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Role","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.817Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"User","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.817Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Group","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.817Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"InstanceProfile","resync period seconds":36000}
{"level":"info","ts":"2024-03-11T13:47:53.817Z","logger":"setup","msg":"starting manager","aws.service":"iam"}
...
{"level":"info","ts":"2024-03-11T13:47:53.928Z","msg":"Starting workers","controller":"policy","controllerGroup":"iam.services.k8s.aws","controllerKind":"Policy","worker count":1}
{"level":"debug","ts":"2024-03-11T13:47:53.936Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"> r.Sync","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> r.resetConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"<< r.resetConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> rm.ResolveReferences","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"<< rm.ResolveReferences","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> rm.EnsureTags","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"<< rm.EnsureTags","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> rm.ReadOne","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">>> rm.sdkFind","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.944Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
// Config map read here
{"level":"debug","ts":"2024-03-11T13:47:53.972Z","logger":"cache.account","msg":"created account config map","name":"ack-role-account-map"}
// Role is empty
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<<< rm.sdkFind","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10"}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<< rm.ReadOne","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10"}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":">> r.ensureConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":">>> rm.IsSynced","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<<< rm.IsSynced","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<< r.ensureConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"< r.Sync","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10"}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"> r.patchResourceStatus","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":">> kc.Patch (status)","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.038Z","logger":"ackrt","msg":"patched resource status","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"json":"{\"metadata\":{\"resourceVersion\":\"462345426\"},\"spec\":{...},\"status\":{\"conditions\":[{\"message\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10\",\"status\":\"True\",\"type\":\"ACK.Recoverable\"},{\"lastTransitionTime\":\"2024-03-11T13:47:54Z\",\"message\":\"Unable to determine if desired resource state matches latest observed state\",\"reason\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10\",\"status\":\"Unknown\",\"type\":\"ACK.ResourceSynced\"}]}}"}
{"level":"debug","ts":"2024-03-11T13:47:54.038Z","logger":"ackrt","msg":"<< kc.Patch (status)","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.038Z","logger":"ackrt","msg":"< r.patchResourceStatus","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"error","ts":"2024-03-11T13:47:54.038Z","msg":"Reconciler error","controller":"role","controllerGroup":"iam.services.k8s.aws","controllerKind":"Role","Role":{"name":"iam-role-1","namespace":"namespace-1"},"namespace":"namespace-1","name":"iam-role-1","reconcileID":"9d3c3474-c71e-4a60-a544-efe1bfe759c9","error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
// Role is empty
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"> r.Sync","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> r.resetConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"<< r.resetConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> rm.ResolveReferences","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"<< rm.ResolveReferences","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> rm.EnsureTags","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"<< rm.EnsureTags","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> rm.ReadOne","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">>> rm.sdkFind","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<<< rm.sdkFind","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626"}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<< rm.ReadOne","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626"}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":">> r.ensureConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":">>> rm.IsSynced","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<<< rm.IsSynced","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<< r.ensureConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"< r.Sync","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626"}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"> r.patchResourceStatus","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":">> kc.Patch (status)","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.091Z","logger":"ackrt","msg":"patched resource status","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"json":"{\"metadata\":{\"resourceVersion\":\"462345427\"},\"spec\":{...},\"status\":{\"conditions\":[{\"message\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626\",\"status\":\"True\",\"type\":\"ACK.Recoverable\"},{\"lastTransitionTime\":\"2024-03-11T13:47:54Z\",\"message\":\"Unable to determine if desired resource state matches latest observed state\",\"reason\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626\",\"status\":\"Unknown\",\"type\":\"ACK.ResourceSynced\"}]}}"}
{"level":"debug","ts":"2024-03-11T13:47:54.091Z","logger":"ackrt","msg":"<< kc.Patch (status)","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.091Z","logger":"ackrt","msg":"< r.patchResourceStatus","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"error","ts":"2024-03-11T13:47:54.091Z","msg":"Reconciler error","controller":"role","controllerGroup":"iam.services.k8s.aws","controllerKind":"Role","Role":{"name":"iam-role-2","namespace":"namespace-2"},"namespace":"namespace-2","name":"iam-role-2","reconcileID":"58482afd-89b1-45eb-8630-724deacf338c","error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

Even waiting several minutes, I'm not seeing the re-queuing behavior described in the PRs.

a-hilaly commented 3 months ago

@nampnguyen I will try to reproduce locally, maybe i'm missing something in here... qq: does the namespace have an owner-account-id annotation? the only case where controllers will skip the requeue is if the namespaces aren't annotated..

nampnguyen commented 3 months ago

@a-hilaly Yes, the namespaces do have the services.k8s.aws/owner-account-id annotation. Looking at the fix, I think the issue we're currently experiencing may have been introduced by https://github.com/aws-controllers-k8s/runtime/commit/24070d995cf503d77224b15771aa3c938fe6062c

If I'm reading this correctly, this line here assumes that if the resource's Account ID is the same as the controller's IRSA account ID, then CARM is not used.

https://github.com/aws-controllers-k8s/runtime/blob/20e1c0d714a90e806d885082922fc3cc8d5358c1/pkg/runtime/reconciler.go#L1029

In the most recent logs I shared, the resource's Account ID and Controller account are in fact the same, but we still need the CARM role pivot. I think this can be fixed by always checking for the namespace annotation and if present use CARM.

a-hilaly commented 3 months ago

@nampnguyen makes sense~ working on a fix! I think we might also want to enable CARM even in single-namespace watch mode.

If I'm reading this correctly, this line here assumes that if the resource's Account ID is the same as the controller's IRSA account ID, then CARM is not used.

Correct. Not sure, but this might have been the case even before.

mattzech commented 1 month ago

@a-hilaly @nampnguyen We are also facing this exact issue when using most recent version of iam-controller. Glad to hear a fix is being worked on. Thanks!

mumlawski commented 3 weeks ago

@a-hilaly we do still face this same issue even after your pull request from April. Just wondering if someone is looking at this problem still?

a-hilaly commented 3 weeks ago

@mattzech @mumlawski @nampnguyen Sorry folks I somehow got side tracked from this issue. This is a high priority issue and I i'll ship a fix for it ASAP.