aws-controllers-k8s / community

AWS Controllers for Kubernetes (ACK) is a project enabling you to manage AWS services from Kubernetes
https://aws-controllers-k8s.github.io/community/
Apache License 2.0
2.39k stars 253 forks source link

EKS stops reconcile after ACK.Terminal status condition #1844

Open tomitesh opened 1 year ago

tomitesh commented 1 year ago

Describe the bug A concise description of what the bug is.

We have created a cluster using below details (please note , role arn provided in cluster definition).

if role does not exists when cluster is created (race condition), it shows ACK.Terminal condition in cluster status and never gets resolved even role is created successfully in next 1-2 seconds.

Both eks and iam controllers are configured to reconcile every 10 to 20 seconds (configuration attached in next section).

however if i restart eks controller by deleting pod, it reconclies successfully and removes ACK.Terminal condition. This solution is not practical as we can not keep restarting pod for every change in yaml.

Steps to reproduce

step 1 : create cluster first Step 2: create role

apiVersion: eks.services.k8s.aws/v1alpha1
kind: Cluster
metadata:
  annotations:
    services.k8s.aws/deletion-policy: delete
  finalizers:
  - finalizers.eks.services.k8s.aws/Cluster
  name: moon
  namespace: control
spec:
  kubernetesNetworkConfig:
    ipFamily: ipv4
    serviceIPv4CIDR: 172.20.0.0/16
  logging:
    clusterLogging:
    - enabled: true
      types:
      - api
      - audit
      - authenticator
      - controllerManager
      - scheduler
  name: moon
  resourcesVPCConfig:
    endpointPrivateAccess: true
    endpointPublicAccess: true
    publicAccessCIDRs:
    - 123.45.67.89/32
    securityGroupIDs:
    - sg-123
    subnetIDs:
    - subnet-123
    - subnet-456
    - subnet-789
  roleARN: arn:aws:iam::1234567890:role/moon-eks-cluster
  version: "1.25"
status:
  ackResourceMetadata:
    ownerAccountID: "1234567890"
    region: eu-central-1
  conditions:
  - message: |-
      InvalidParameterException: The provided role doesn't have the Amazon EKS Managed Policies associated with it. Please ensure the following policies [arn:aws:iam::aws:policy/AmazonEKSClusterPolicy] are attached
      {
        RespMetadata: {
          StatusCode: 400,
          RequestID: "aacb3dc6-6bdd-4031-a67e-ae6d461f7e4b"
        },
        ClusterName: "moon",
        Message_: "The provided role doesn't have the Amazon EKS Managed Policies associated with it. Please ensure the following policies [arn:aws:iam::aws:policy/AmazonEKSClusterPolicy] are attached"
      }
    status: "True"
    type: ACK.Terminal
  - lastTransitionTime: "2023-07-11T09:39:55Z"
    message: Resource not synced
    reason: resource is in terminal condition
    status: "False"
    type: ACK.ResourceSynced

role definition

apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
  annotations:
    services.k8s.aws/deletion-policy: delete
  finalizers:
  - finalizers.iam.services.k8s.aws/Role
  name: moon-eks-cluster
  namespace: control
spec:
  assumeRolePolicyDocument: |-
    {
                    "Version": "2012-10-17",
                    "Statement": [
                        {
                            "Sid": "EKSClusterAssumeRole",
                            "Effect": "Allow",
                            "Principal": {
                                "Service": "eks.amazonaws.com"
                            },
                            "Action": "sts:AssumeRole"
                        }
                    ]
                }
  description: IAM role that is used by an eks cluster.
  inlinePolicies:
    cluster-elb-sl: |-
      {
                      "Version": "2012-10-17",
                      "Statement": [
                          {
                              "Action": [
                                  "ec2:DescribeInternetGateways",
                                  "ec2:DescribeAddresses",
                                  "ec2:DescribeAccountAttributes"
                              ],
                              "Effect": "Allow",
                              "Resource": "*",
                              "Sid": ""
                          }
                      ]
                  }
  maxSessionDuration: 3600
  name: moon-eks-cluster
  path: /
  policies:
  - arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
  - arn:aws:iam::aws:policy/AmazonEKSServicePolicy
  - arn:aws:iam::aws:policy/AmazonEKSVPCResourceController

status:
  ackResourceMetadata:
    arn: arn:aws:iam::1234567890:role/moon-eks-cluster
    ownerAccountID: "1234567890"
    region: eu-central-1
  conditions:
  - lastTransitionTime: "2023-07-11T10:11:59Z"
    message: Late initialization successful
    reason: Late initialization successful
    status: "True"
    type: ACK.LateInitialized
  - lastTransitionTime: "2023-07-11T10:11:59Z"
    message: Resource synced successfully
    reason: ""
    status: "True"
    type: ACK.ResourceSynced
  createDate: "2023-07-11T09:39:54Z"
  roleID: XXXXXXXXXXXXXXXXX
  roleLastUsed: {}

Both eks and iam controller are configured to reconcile every 10 to 20 seconds.

i.e eks helm chart values when installing controller

    reconcile:
      resourceResyncPeriods: {
        Nodegroup: 10,
        Cluster: 20,
        Addon: 15
      }

iam helm chart values when installing controller

    reconcile:
      resourceResyncPeriods: {
        Role: 10
      }

Expected outcome A concise description of what you expected to happen. As eks controller is configured to reconclile every 20 seconds, it should automatiicay sync in next reconcile loop after role is available.

Environment dev

a-hilaly commented 1 year ago

This is very similar to https://github.com/aws-controllers-k8s/community/issues/1835 - we can definitely send a fix for this. WDYT @RedbackThomson

RedbackThomson commented 1 year ago

@a-hilaly I agree, these two are essentially the same issue.

@tomitesh We recommend that you use RoleRef instead of RoleARN if you are creating the role using the iam-controller. That way, the EKS controller will know to wait until the role has been created before it attempts to use it to create the cluster.

Also the resourceResyncPeriods don't affect resources that reach terminal status. Terminal is designed to indicate to the controller to stop reconciling, since the controller believes it has hit an error condition that it cannot recover from (without changes to the spec). The resync periods are for resources that have reached synced state (the AWS resource matches the spec), before attempting to reconcile.

gecube commented 1 year ago

@RedbackThomson Hi! I totally agree with you, it makes sense. But the issue is that I am facing some weird behaviour of IAM controller. So, for instance, I have a role:

apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
  resourceVersion: '30438909'
  name: eks-production-nodegroup-role
  uid: d3357f19-54ff-4393-9466-461e81ec3a53
  creationTimestamp: '2023-07-20T09:00:42Z'
  generation: 4
  managedFields:
...
...
...
  namespace: infra-production
  finalizers:
    - finalizers.iam.services.k8s.aws/Role
  labels:
    kustomize.toolkit.fluxcd.io/name: infra-management
    kustomize.toolkit.fluxcd.io/namespace: flux-system
spec:
  assumeRolePolicyDocument: >-
    {"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}
  inlinePolicies: {}
  maxSessionDuration: 3600
  name: eks-production-nodegroup-role
  path: /
  policies:
    - 'arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly'
    - 'arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy'
    - 'arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy'
    - 'arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore'
status:
  ackResourceMetadata:
    arn: 'arn:aws:iam::966321756598:role/eks-production-nodegroup-role'
    ownerAccountID: '966321756598'
    region: eu-west-2
  conditions:
    - lastTransitionTime: '2023-07-24T07:16:22Z'
      message: 'Late initialization did not complete, requeuing with delay of 5 seconds'
      reason: Delayed Late Initialization
      status: 'False'
      type: ACK.LateInitialized
    - lastTransitionTime: '2023-07-24T07:16:22Z'
      status: 'False'
      type: ACK.ResourceSynced
  createDate: '2023-07-20T09:00:42Z'
  roleID: AROA6B7KD3G3M4Q367RIW
  roleLastUsed:
    lastUsedDate: '2023-07-24T06:49:54Z'
    region: eu-west-2

It is created, but the status is still ACK.ResourceSynced=False. The role could be used by ARN directly, but not with RoleRef Any suggestions? The logs of IAM controller don't give any clue.

tomitesh commented 1 year ago

@RedbackThomson : I used RoleARN and not RoleRef as RoleRef did not work for this scenario (disaster recovery/adopting existing aws resource).

  1. Create a role (moon-eks-cluster) with annotation (services.k8s.aws/deletion-policy: retain)
  2. Create a cluster (moon) with RoleRef and with annotation (services.k8s.aws/deletion-policy: retain)
  3. kubectl delete cluster moon -n control (this will delete resource from k8s and keep aws resource)
  4. kubectl delete role moon-eks-cluster -n control (this will delete resource from k8s and keep aws resource)
  5. Create a role and cluster (same as step1 and 2)

after step 5, you will notice

  1. role syncs successfully (however it will not have role-arn in status)
  2. cluster fails sync as the role referenced using RoleRef has not updated stauts with role-arn.

Note : we are using gitops and rancher fleet to achive desire state. so can't use adopted resource as it required manual interventions.

ack-bot commented 6 months ago

Issues go stale after 180d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 60d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

gecube commented 6 months ago

/remove-lifecycle stale

ack-bot commented 1 week ago

Issues go stale after 180d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 60d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

gecube commented 1 week ago

/remove-lifecycle stale