[Bug] IAM Service Account creation retries fail because of cloudformation stack status

ndegory commented 2 years ago

What were you trying to accomplish?

Infrastructure provisioning workflow with 2 steps, first Terraform for IaaS resources, and second eksctl for EKS related resources. The Terraform job includes creation of custom IAM policies that are used by the service accounts defined in the EKS cluster config. When the configuration is not consistent between these two steps, the EKS related job may fail. Fixing it should only require to fix the configuration and run the pipeline again.

What happened?

Creation of IAM resources and Kubernetes service account with the eksctl create cluster or eksctl create iamserviceaccount command fails when pre-requisites are not there (for instance an IAM policy). Fixing the Terraform configuration is enough to let the Terraform job fix the pre-requisites, but when the EKS job runs, it fails to recover.

This is caused by:

the iamserviceaccount create command fails to notice that the IAM role has not been created, exiting without error as if the state described in the config file was reached
the first pipeline execution left the cloudformation stack related to the IAM role in the ROLLBACK_COMPLETE status, which wouldn't allow the creation of a new stack So far the workaround is to run the delete iamserviceaccount command for the service account impacted by the issue, and then run the create iamserviceaccount command again with the cluster config file, but this is not compatible with a declarative approach.

How to reproduce it?

Creation of a cluster, with a config file including an IAM service account referring to an IAM policy not yet created (the missing pre-requisite):

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: cluster-1
  region: us-west-2
nodeGroups:
  - name: ng-1
    instanceType: m5a.large
    desiredCapacity: 1
iam:
  withOIDC: true
  serviceAccounts:
   - metadata:
       name: external-dns
       namespace: kube-system
     wellKnownPolicies:
       externalDNS: true
   - metadata:
       name: some-app
       namespace: default
     attachPolicyARNs:
       - "arn:aws:iam::aws:policy/AmazonDynamoDBReadOnlyAccess"
   - metadata:
       name: another-app
       namespace: default
     attachPolicyARNs:
       - "arn:aws:iam::<ACCOUNT_ID>:policy/missing-policy-for-app"

Notice the last service account (another-app), the policy has deliberately not been created.

Logs

➜ eksctl create cluster --config-file=./cluster.yaml
2022-03-19 13:43:21 [ℹ]  eksctl version 0.88.0
2022-03-19 13:43:21 [ℹ]  using region us-west-2
2022-03-19 13:43:21 [ℹ]  setting availability zones to [us-west-2b us-west-2d us-west-2a]
2022-03-19 13:43:21 [ℹ]  subnets for us-west-2b - public:192.168.0.0/19 private:192.168.96.0/19
2022-03-19 13:43:21 [ℹ]  subnets for us-west-2d - public:192.168.32.0/19 private:192.168.128.0/19
2022-03-19 13:43:21 [ℹ]  subnets for us-west-2a - public:192.168.64.0/19 private:192.168.160.0/19
2022-03-19 13:43:21 [ℹ]  nodegroup "ng-1" will use "ami-085e8e02353a59de5" [AmazonLinux2/1.21]
2022-03-19 13:43:21 [ℹ]  using Kubernetes version 1.21
2022-03-19 13:43:21 [ℹ]  creating EKS cluster "cluster-1" in "us-west-2" region with un-managed nodes
2022-03-19 13:43:21 [ℹ]  1 nodegroup (ng-1) was included (based on the include/exclude rules)
2022-03-19 13:43:21 [ℹ]  will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s)
2022-03-19 13:43:21 [ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
2022-03-19 13:43:21 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=cluster-1'
2022-03-19 13:43:21 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "cluster-1" in "us-west-2"
2022-03-19 13:43:21 [ℹ]  CloudWatch logging will not be enabled for cluster "cluster-1" in "us-west-2"
2022-03-19 13:43:21 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-west-2 --cluster=cluster-1'
2022-03-19 13:43:21 [ℹ]
2 sequential tasks: { create cluster control plane "cluster-1",
    2 sequential sub-tasks: {
        4 sequential sub-tasks: {
            wait for control plane to become ready,
            associate IAM OIDC provider,
            4 parallel sub-tasks: {
                2 sequential sub-tasks: {
                    create IAM role for serviceaccount "kube-system/external-dns",
                    create serviceaccount "kube-system/external-dns",
                },
                2 sequential sub-tasks: {
                    create IAM role for serviceaccount "default/some-app",
                    create serviceaccount "default/some-app",
                },
                2 sequential sub-tasks: {
                    create IAM role for serviceaccount "default/another-app",
                    create serviceaccount "default/another-app",
                },
                2 sequential sub-tasks: {
                    create IAM role for serviceaccount "kube-system/aws-node",
                    create serviceaccount "kube-system/aws-node",
                },
            },
            restart daemonset "kube-system/aws-node",
        },
        create nodegroup "ng-1",
    }
}
2022-03-19 13:43:21 [ℹ]  building cluster stack "eksctl-cluster-1-cluster"
2022-03-19 13:43:22 [ℹ]  deploying stack "eksctl-cluster-1-cluster"
2022-03-19 13:43:52 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:44:22 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:45:22 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:46:22 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:47:23 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:48:23 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:49:23 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:50:23 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:51:24 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:52:24 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:53:24 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:54:24 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-cluster"
2022-03-19 13:56:27 [ℹ]  building iamserviceaccount stack "eksctl-cluster-1-addon-iamserviceaccount-default-some-app"
2022-03-19 13:56:27 [ℹ]  building iamserviceaccount stack "eksctl-cluster-1-addon-iamserviceaccount-kube-system-external-dns"
2022-03-19 13:56:27 [ℹ]  building iamserviceaccount stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 13:56:27 [ℹ]  building iamserviceaccount stack "eksctl-cluster-1-addon-iamserviceaccount-kube-system-aws-node"
2022-03-19 13:56:27 [ℹ]  deploying stack "eksctl-cluster-1-addon-iamserviceaccount-default-some-app"
2022-03-19 13:56:27 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-default-some-app"
2022-03-19 13:56:27 [ℹ]  deploying stack "eksctl-cluster-1-addon-iamserviceaccount-kube-system-external-dns"
2022-03-19 13:56:27 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-kube-system-external-dns"
2022-03-19 13:56:27 [ℹ]  deploying stack "eksctl-cluster-1-addon-iamserviceaccount-kube-system-aws-node"
2022-03-19 13:56:27 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-kube-system-aws-node"
2022-03-19 13:56:27 [ℹ]  deploying stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 13:56:27 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 13:56:42 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-kube-system-aws-node"
2022-03-19 13:56:45 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 13:56:46 [✖]  unexpected status "ROLLBACK_COMPLETE" while waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 13:56:46 [ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
2022-03-19 13:56:46 [!]  AWS::IAM::Role/Role1: DELETE_IN_PROGRESS
2022-03-19 13:56:46 [✖]  AWS::IAM::Role/Role1: CREATE_FAILED – "Policy arn:aws:iam::<ACCOUNT_ID>:policy/missing-policy-for-app does not exist or is not attachable. (Service: AmazonIdentityManagement; Status Code: 404; Error Code: NoSuchEntity; Request ID: b6d80cfd-49a1-412a-a2d8-da3bba472374; Proxy: null)"
2022-03-19 13:56:46 [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2022-03-19 13:56:46 [ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=cluster-1'
2022-03-19 13:56:46 [✖]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app": ResourceNotReady: failed waiting for successful resource state
Error: failed to create cluster "cluster-1"

## Cloudformation stack status

➜ aws cloudformation --region us-west-2 list-stacks | jq -r '.StackSummaries[] | select(.StackName == "eksctl-cluster-1-addon-iamserviceaccount-default-another-app")'
{
  "StackId": "arn:aws:cloudformation:us-west-2:<ACCOUNT_ID>:stack/eksctl-cluster-1-addon-iamserviceaccount-default-another-app/0ba361b0-a7c7-11ec-b389-06d4d5494b3b",
  "StackName": "eksctl-cluster-1-addon-iamserviceaccount-default-another-app",
  "TemplateDescription": "IAM role for serviceaccount \"default/another-app\" [created and managed by eksctl]",
  "CreationTime": "2022-03-19T20:56:27.761000+00:00",
  "DeletionTime": "2022-03-19T20:56:34.173000+00:00",
  "StackStatus": "ROLLBACK_COMPLETE",
  "DriftInformation": {
    "StackDriftStatus": "NOT_CHECKED"
  }
}

##  try to create the IAM role again:

➜ eksctl create iamserviceaccount --config-file cluster.yaml --override-existing-serviceaccounts
2022-03-19 14:06:11 [ℹ]  eksctl version 0.88.0
2022-03-19 14:06:11 [ℹ]  using region us-west-2
2022-03-19 14:06:12 [ℹ]  4 existing iamserviceaccount(s) (default/another-app,default/some-app,kube-system/aws-node,kube-system/external-dns) will be excluded
2022-03-19 14:06:12 [ℹ]  3 iamserviceaccounts (default/another-app, default/some-app, kube-system/external-dns) were excluded (based on the include/exclude rules)
2022-03-19 14:06:12 [!]  metadata of serviceaccounts that exist in Kubernetes will be updated, as --override-existing-serviceaccounts was set
2022-03-19 14:06:12 [ℹ]  no tasks

# Requires to delete it first

➜ eksctl delete iamserviceaccount --region us-west-2 --cluster cluster-1 --namespace default --name another-app
2022-03-19 14:11:35 [ℹ]  eksctl version 0.88.0
2022-03-19 14:11:35 [ℹ]  using region us-west-2
2022-03-19 14:11:35 [ℹ]  1 iamserviceaccount (default/another-app) was included (based on the include/exclude rules)
2022-03-19 14:11:36 [ℹ]  1 task: {
    2 sequential sub-tasks: {
        delete IAM role for serviceaccount "default/another-app" [async],
        delete serviceaccount "default/another-app",
    } }2022-03-19 14:11:36 [ℹ]  will delete stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 14:11:36 [ℹ]  serviceaccount "default/another-app" was already deleted

# Cloudformation stack is now in DELETE_COMPLETE status

➜ aws cloudformation --region us-west-2 list-stacks | jq -r '.StackSummaries[] | select(.StackName == "eksctl-cluster-1-addon-iamserviceaccount-default-another-app")'
{
  "StackId": "arn:aws:cloudformation:us-west-2:<ACCOUNT_ID>:stack/eksctl-cluster-1-addon-iamserviceaccount-default-another-app/0ba361b0-a7c7-11ec-b389-06d4d5494b3b",
  "StackName": "eksctl-cluster-1-addon-iamserviceaccount-default-another-app",
  "TemplateDescription": "IAM role for serviceaccount \"default/another-app\" [created and managed by eksctl]",
  "CreationTime": "2022-03-19T20:56:27.761000+00:00",
  "DeletionTime": "2022-03-19T21:11:36.669000+00:00",
  "StackStatus": "DELETE_COMPLETE",
  "DriftInformation": {
    "StackDriftStatus": "NOT_CHECKED"
  }
}

# upgrade cluster won't create the resources

➜ eksctl upgrade cluster --config-file cluster.yaml
2022-03-19 14:13:10 [ℹ]  eksctl version 0.88.0
2022-03-19 14:13:10 [ℹ]  using region us-west-2
2022-03-19 14:13:10 [!]  NOTE: cluster VPC (subnets, routing & NAT Gateway) configuration changes are not yet implemented
2022-03-19 14:13:12 [ℹ]  no cluster version update required
2022-03-19 14:13:12 [ℹ]  re-building cluster stack "eksctl-cluster-1-cluster"
2022-03-19 14:13:12 [✔]  all resources in cluster stack "eksctl-cluster-1-cluster" are up-to-date

# but create iamserviceaccount will

➜ eksctl create iamserviceaccount --config-file cluster.yaml --override-existing-serviceaccounts --approve
2022-03-19 14:13:49 [ℹ]  eksctl version 0.88.0
2022-03-19 14:13:49 [ℹ]  using region us-west-2
2022-03-19 14:13:50 [ℹ]  3 existing iamserviceaccount(s) (default/some-app,kube-system/aws-node,kube-system/external-dns) will be excluded
2022-03-19 14:13:50 [ℹ]  1 iamserviceaccount (default/another-app) was included (based on the include/exclude rules)
2022-03-19 14:13:50 [ℹ]  2 iamserviceaccounts (default/some-app, kube-system/external-dns) were excluded (based on the include/exclude rules)
2022-03-19 14:13:50 [!]  metadata of serviceaccounts that exist in Kubernetes will be updated, as --override-existing-serviceaccounts was set
2022-03-19 14:13:50 [ℹ]  1 task: {
    2 sequential sub-tasks: {
        create IAM role for serviceaccount "default/another-app",
        create serviceaccount "default/another-app",
    } }2022-03-19 14:13:50 [ℹ]  building iamserviceaccount stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 14:13:50 [ℹ]  deploying stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 14:13:50 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 14:14:09 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 14:14:27 [ℹ]  waiting for CloudFormation stack "eksctl-cluster-1-addon-iamserviceaccount-default-another-app"
2022-03-19 14:14:28 [ℹ]  created serviceaccount "default/another-app"

Versions

➜ eksctl info
eksctl version: 0.88.0
kubectl version: v1.23.5
OS: darwin

I'll proceed with a PR that implements a more reliable workflow.

Himangini commented 2 years ago

@ndegory Thanks for opening the detailed issue and a follow-up PR ⭐ We will review the issue and PR soon. 👍🏻

Skarlso commented 2 years ago

Hi @ndegory.

You are correct. eksctl is not declarative. It's imperative. Meaning, you have to run a delete than create if something went wrong. It won't detect existing things.

ndegory commented 2 years ago

@Skarlso, fair enough, but right now, under certain conditions, the iamserviceaccount command lets you think the cluster state matches the specifications from the config server YAML file (because no actions, no errors), although some resources are not in the expected state. This action is not atomic, which is problematic for an imperative command. The PR I submitted for this ticket may go too far (I understand there may be concerns about deleting a previously failed stack), if that's the case please tell me, and I'll reduce the scope by only adding more controls for a better output in that particular use case.

Skarlso commented 2 years ago

@ndegory We decided to pull this into planning and will think of a nice solution that will still leave the command consistent with other commands. :) There was something similar previously that deals with the nature of iamserviceaccount commands here: https://github.com/weaveworks/eksctl/issues/4941. It's not similar in the problem but similar in the nature that existing or non-existing resources throws off the create command and leaves things in an inconsistent state.

Maybe we can still do something here that will not result in a problematic environment or is more user friendly. We'll discuss this with the team.

Himangini commented 2 years ago

We will reproduce this on our side and help with the PR for adding the validations

Skarlso commented 2 years ago

@ndegory Ok, so, the decision is as follows... create will still not be much aware about the circumstances and the infrastructure you run it on. If there were things that you created outside of eksctl that won't really matter for eksctl. Again, it's not a declarative tool.

That said! We can certainly improve upon this part:

Creation of a cluster, with a config file including an IAM service account referring to an IAM policy not yet created (the missing pre-requisite):

Mainly this: (the missing pre-requisite).

Are you willing to adjust your PR to do a check for this resource to exist and only proceed if yes? :)

ndegory commented 2 years ago

@Skarlso , yes, I can give it a try. Checking that the IAM policy exists, when the specs specify an IAM policy attachment.

I would also like to add one more thing, which was kind of covered by the current PR, which is to react differently when there's an existing Cloudformation stack for that IAM service account, and exit in error when that existing stack is in ROLLBACK_COMPLETE status, instead of the current behavior which ignores it and considers all is well and there's nothing to do. Would that be ok for you?

Skarlso commented 2 years ago

@ndegory Sadly, that would be going too far. I mean, that could result in detecting a stack which isn't the stack you want. Or happens to be there because of the same name. If there would have been a stack that had been created during the create and would have failed the create would have failed, right?

Or are you saying that the create just happily jugged on even if the stack failed to CREATE_COMPLETE? If there was a stack created separately from iamserviceaccount create that would be outside the scope of the create command.

ndegory commented 2 years ago

@Skarlso , correct, a stack created by a previous call to iamserviceaccount create. This is the Cloud, so we know issues can arise (networking, API errors, etc.) resulting in a failure that could be resolved if retried later. The current problem with this command is that once it fails, retrying it without first deleting the rolled back stack doesn't complain about the state of the stack, it considers no action has to be performed. My suggestion is that the iamserviceaccount create command should detect that something's wrong and alert the user that something has to be done (such as a manual deletion of the rolled back stack).

Skarlso commented 2 years ago

My suggestion is that the iamserviceaccount create command should detect that something's wrong and alert the user that something has to be done (such as a manual deletion of the rolled back stack).

Yeah, okay, that is a fair point. I agree to that. Thanks for the explanation!

What I was trying to convey is that it should warn the user and not attempt to remedy the situation. Is that okay?

ndegory commented 2 years ago

we're aligned!

Skarlso commented 2 years ago

Excellent! :)

Himangini commented 2 years ago

the PR in question is in the draft, moving this ticket to the Blocked column until the PR is ready to review

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ndegory commented 2 years ago

still topical

Himangini commented 2 years ago

@ndegory are you still working on this? you had a PR open but seems like it's closed?

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 5 days with no activity.

eksctl-io / eksctl