Open tomitesh opened 1 year ago
This is very similar to https://github.com/aws-controllers-k8s/community/issues/1835 - we can definitely send a fix for this. WDYT @RedbackThomson
@a-hilaly I agree, these two are essentially the same issue.
@tomitesh We recommend that you use RoleRef
instead of RoleARN
if you are creating the role using the iam-controller. That way, the EKS controller will know to wait until the role has been created before it attempts to use it to create the cluster.
Also the resourceResyncPeriods
don't affect resources that reach terminal status. Terminal is designed to indicate to the controller to stop reconciling, since the controller believes it has hit an error condition that it cannot recover from (without changes to the spec). The resync periods are for resources that have reached synced state (the AWS resource matches the spec), before attempting to reconcile.
@RedbackThomson Hi! I totally agree with you, it makes sense. But the issue is that I am facing some weird behaviour of IAM controller. So, for instance, I have a role:
apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
resourceVersion: '30438909'
name: eks-production-nodegroup-role
uid: d3357f19-54ff-4393-9466-461e81ec3a53
creationTimestamp: '2023-07-20T09:00:42Z'
generation: 4
managedFields:
...
...
...
namespace: infra-production
finalizers:
- finalizers.iam.services.k8s.aws/Role
labels:
kustomize.toolkit.fluxcd.io/name: infra-management
kustomize.toolkit.fluxcd.io/namespace: flux-system
spec:
assumeRolePolicyDocument: >-
{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}
inlinePolicies: {}
maxSessionDuration: 3600
name: eks-production-nodegroup-role
path: /
policies:
- 'arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly'
- 'arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy'
- 'arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy'
- 'arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore'
status:
ackResourceMetadata:
arn: 'arn:aws:iam::966321756598:role/eks-production-nodegroup-role'
ownerAccountID: '966321756598'
region: eu-west-2
conditions:
- lastTransitionTime: '2023-07-24T07:16:22Z'
message: 'Late initialization did not complete, requeuing with delay of 5 seconds'
reason: Delayed Late Initialization
status: 'False'
type: ACK.LateInitialized
- lastTransitionTime: '2023-07-24T07:16:22Z'
status: 'False'
type: ACK.ResourceSynced
createDate: '2023-07-20T09:00:42Z'
roleID: AROA6B7KD3G3M4Q367RIW
roleLastUsed:
lastUsedDate: '2023-07-24T06:49:54Z'
region: eu-west-2
It is created, but the status is still ACK.ResourceSynced=False. The role could be used by ARN directly, but not with RoleRef
Any suggestions? The logs of IAM controller don't give any clue.
@RedbackThomson : I used RoleARN and not RoleRef as RoleRef did not work for this scenario (disaster recovery/adopting existing aws resource).
after step 5, you will notice
Note : we are using gitops and rancher fleet to achive desire state. so can't use adopted resource as it required manual interventions.
Issues go stale after 180d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 60d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 180d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 60d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale
/remove-lifecycle stale
Describe the bug A concise description of what the bug is.
We have created a cluster using below details (please note , role arn provided in cluster definition).
if role does not exists when cluster is created (race condition), it shows ACK.Terminal condition in cluster status and never gets resolved even role is created successfully in next 1-2 seconds.
Both eks and iam controllers are configured to reconcile every 10 to 20 seconds (configuration attached in next section).
however if i restart eks controller by deleting pod, it reconclies successfully and removes ACK.Terminal condition. This solution is not practical as we can not keep restarting pod for every change in yaml.
Steps to reproduce
step 1 : create cluster first Step 2: create role
role definition
Both eks and iam controller are configured to reconcile every 10 to 20 seconds.
i.e eks helm chart values when installing controller
iam helm chart values when installing controller
Expected outcome A concise description of what you expected to happen. As eks controller is configured to reconclile every 20 seconds, it should automatiicay sync in next reconcile loop after role is available.
Environment dev