Unable to launch eks nodegroups from inside docker container using eksctl

vineet-krishna commented 4 years ago

What happened? I am trying to launch multiple nodegroups in parallel right after creating an eks cluster.

#!/usr/bin/env bash

set -ex

eksctl create cluster -f cluster.yml

eksctl create nodegroup -f ng-app.yml &
eksctl create nodegroup -f ng-default.yml &
eksctl create nodegroup -f ng-kiam.yml &
eksctl create nodegroup -f ng-monitoring.yml &
eksctl create nodegroup -f ng-orch-compute.yml &

wait

set +ex

While passing this as docker entry point only one of the node groups gets created and the docker container keeps running indefinitely.

The same script runs fine while I run it as a bash script in my terminal.

How to reproduce it? Use the following dockerfile and add entrypoint file as mentioned above and add the required yaml files i.e. cluster.yml, ng-app.yml, ng-default.yml, ng-kiam.yml, ng-monitoring.yml, ng-orch-compute.yml

FROM ubuntu:latest

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update && \
  apt-get -y upgrade

RUN apt-get install -y --no-install-recommends \
  git \
  curl \
  openssh-server \
  ssh-client \
  awscli

# Install kubectl binary
RUN curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.15.0/bin/linux/amd64/kubectl && chmod +x ./kubectl && mv ./kubectl /usr/local/bin/kubectl

# Install eksctl
RUN curl --silent --location "https://github.com/weaveworks/eksctl/releases/download/latest_release/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp && mv /tmp/eksctl /usr/local/bin

COPY . /.

RUN chmod +x /docker-entrypoint.sh

ENTRYPOINT /docker-entrypoint.sh

Versions

$ eksctl version
0.15.0
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Logs

+ eksctl create cluster -f nodegroups/dr/cluster.yml
[ℹ]  eksctl version 0.15.0
[ℹ]  using region ap-southeast-1
[✔]  using existing VPC (vpc-2586eb42) and subnets (private:[subnet-0b710a88e8b97b5c4 subnet-0c82eec08a0470b9d] public:[subnet-032c5c35cf3d4bef4 subnet-0eb22060623eca6d8])
[!]  custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets
[ℹ]  using Kubernetes version 1.14
[ℹ]  creating EKS cluster "dr-eks" in "ap-southeast-1" region with
[ℹ]  will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
[ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
[ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=ap-southeast-1 --cluster=dr-eks'
[ℹ]  CloudWatch logging will not be enabled for cluster "dr-eks" in "ap-southeast-1"
[ℹ]  you can enable it with 'eksctl utils update-cluster-logging --region=ap-southeast-1 --cluster=dr-eks'
[ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "dr-eks" in "ap-southeast-1"
[ℹ]  1 task: { create cluster control plane "dr-eks" }
[ℹ]  building cluster stack "eksctl-dr-eks-cluster"
[ℹ]  deploying stack "eksctl-dr-eks-cluster"
[✔]  all EKS cluster resources for "dr-eks" have been created
[✔]  saved kubeconfig as "/root/.kube/config"
[✖]  unable to use kubectl with the EKS cluster (check 'kubectl version'): usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: argument command: Invalid choice, valid choices are:

acm                                      | alexaforbusiness
apigateway                               | application-autoscaling
appstream                                | appsync
athena                                   | autoscaling
autoscaling-plans                        | batch
budgets                                  | ce
cloud9                                   | clouddirectory
cloudformation                           | cloudfront
cloudhsm                                 | cloudhsmv2
cloudsearch                              | cloudsearchdomain
cloudtrail                               | cloudwatch
codebuild                                | codecommit
codepipeline                             | codestar
cognito-identity                         | cognito-idp
cognito-sync                             | comprehend
cur                                      | datapipeline
dax                                      | devicefarm
directconnect                            | discovery
dms                                      | ds
dynamodb                                 | dynamodbstreams
ec2                                      | ecr
ecs                                      | efs
elasticache                              | elasticbeanstalk
elastictranscoder                        | elb
elbv2                                    | emr
es                                       | events
firehose                                 | gamelift
glacier                                  | glue
greengrass                               | guardduty
health                                   | iam
importexport                             | inspector
iot                                      | iot-data
iot-jobs-data                            | kinesis
kinesis-video-archived-media             | kinesis-video-media
kinesisanalytics                         | kinesisvideo
kms                                      | lambda
lex-models                               | lex-runtime
lightsail                                | logs
machinelearning                          | marketplace-entitlement
marketplacecommerceanalytics             | mediaconvert
medialive                                | mediapackage
mediastore                               | mediastore-data
meteringmarketplace                      | mgh
mobile                                   | mq
mturk                                    | opsworks
opsworkscm                               | organizations
pinpoint                                 | polly
pricing                                  | rds
redshift                                 | rekognition
resource-groups                          | resourcegroupstaggingapi
route53                                  | route53domains
sagemaker                                | sagemaker-runtime
sdb                                      | serverlessrepo
servicecatalog                           | servicediscovery
ses                                      | shield
sms                                      | snowball
sns                                      | sqs
ssm                                      | stepfunctions
storagegateway                           | sts
support                                  | swf
transcribe                               | translate
waf                                      | waf-regional
workdocs                                 | workmail
workspaces                               | xray
s3api                                    | s3
configure                                | deploy
configservice                            | opsworks-cm
runtime.sagemaker                        | history
help

Invalid choice: 'eks', maybe you meant:

  * es
Unable to connect to the server: getting credentials: exec: exit status 2

[ℹ]  cluster should be functional despite missing (or misconfigured) client binaries
[✔]  EKS cluster "dr-eks" in "ap-southeast-1" region is ready
+ date
Tue Mar 24 06:28:27 UTC 2020
++ ls nodegroups/dr/ng-app.yml nodegroups/dr/ng-default.yml nodegroups/dr/ng-kiam.yml nodegroups/dr/ng-monitoring.yml nodegroups/dr/ng-orch-compute.yml
+ for ng in $(ls nodegroups/dr/ng-*.yml)
+ for ng in $(ls nodegroups/dr/ng-*.yml)
+ eksctl create nodegroup -f nodegroups/dr/ng-app.yml
+ for ng in $(ls nodegroups/dr/ng-*.yml)
+ eksctl create nodegroup -f nodegroups/dr/ng-default.yml
+ for ng in $(ls nodegroups/dr/ng-*.yml)
+ eksctl create nodegroup -f nodegroups/dr/ng-kiam.yml
+ for ng in $(ls nodegroups/dr/ng-*.yml)
+ wait
+ eksctl create nodegroup -f nodegroups/dr/ng-monitoring.yml
+ eksctl create nodegroup -f nodegroups/dr/ng-orch-compute.yml
[ℹ]  eksctl version 0.15.0
[ℹ]  using region ap-southeast-1
[ℹ]  eksctl version 0.15.0
[ℹ]  using region ap-southeast-1
[ℹ]  eksctl version 0.15.0
[ℹ]  using region ap-southeast-1
[ℹ]  eksctl version 0.15.0
[ℹ]  using region ap-southeast-1
[ℹ]  eksctl version 0.15.0
[ℹ]  using region ap-southeast-1
[ℹ]  will use version 1.14 for new nodegroup(s) based on control plane version
[ℹ]  will use version 1.14 for new nodegroup(s) based on control plane version
[ℹ]  will use version 1.14 for new nodegroup(s) based on control plane version
[ℹ]  will use version 1.14 for new nodegroup(s) based on control plane version
[ℹ]  will use version 1.14 for new nodegroup(s) based on control plane version
[!]  retryable error (Throttling: Rate exceeded
    status code: 400, request id: ef93d7ba-19e3-4fd0-88fe-04f2f20930f2) from cloudformation/ListStacks - will retry after delay of 595.387637ms
[ℹ]  nodegroup "ng-default-v1" will use "ami-024ce53b56277d5d5" [AmazonLinux2/1.14]
[ℹ]  nodegroup "ng-kiam-v1" will use "ami-024ce53b56277d5d5" [AmazonLinux2/1.14]
[ℹ]  nodegroup "ng-monitoring-v1" will use "ami-024ce53b56277d5d5" [AmazonLinux2/1.14]
[ℹ]  nodegroup "ng-orch-compute-v1" will use "ami-024ce53b56277d5d5" [AmazonLinux2/1.14]
[ℹ]  using EC2 key pair "kubernetes.prod.vpc.mindtickle.com-b9:75:5f:af:aa:21:30:e7:28:ed:fe:ba:3a:ae:a2:76"
[ℹ]  using EC2 key pair "kubernetes.prod.vpc.mindtickle.com-b9:75:5f:af:aa:21:30:e7:28:ed:fe:ba:3a:ae:a2:76"
[ℹ]  using EC2 key pair "kubernetes.prod.vpc.mindtickle.com-b9:75:5f:af:aa:21:30:e7:28:ed:fe:ba:3a:ae:a2:76"
[ℹ]  using EC2 key pair "kubernetes.prod.vpc.mindtickle.com-b9:75:5f:af:aa:21:30:e7:28:ed:fe:ba:3a:ae:a2:76"
[!]  retryable error (Throttling: Rate exceeded
    status code: 400, request id: 7f63744c-b3e6-446e-937f-1356f8dbd04d) from cloudformation/ListStacks - will retry after delay of 963.209508ms
[!]  retryable error (Throttling: Rate exceeded
    status code: 400, request id: a5025356-3250-441a-89de-193f13c1d791) from cloudformation/ListStacks - will retry after delay of 588.882978ms
[!]  retryable error (Throttling: Rate exceeded
    status code: 400, request id: d6da1d46-6ff5-4aaf-a212-93f6c9b39c9d) from cloudformation/ListStacks - will retry after delay of 530.494431ms
[!]  retryable error (Throttling: Rate exceeded
    status code: 400, request id: a200ffd8-90a1-445f-a831-8e37c0a5d5ad) from cloudformation/ListStacks - will retry after delay of 519.513734ms
[!]  retryable error (Throttling: Rate exceeded
    status code: 400, request id: 7566d076-8459-4572-99fb-e5777c5d3b92) from cloudformation/ListStacks - will retry after delay of 1.173720824s
[ℹ]  1 nodegroup (ng-monitoring-v1) was included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for each of 1 nodegroups in cluster "dr-eks"
[ℹ]  2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create nodegroup "ng-monitoring-v1" } } }
[ℹ]  checking cluster stack for missing resources
[ℹ]  1 nodegroup (ng-kiam-v1) was included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for each of 1 nodegroups in cluster "dr-eks"
[ℹ]  2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create nodegroup "ng-kiam-v1" } } }
[ℹ]  checking cluster stack for missing resources
[ℹ]  1 nodegroup (ng-default-v1) was included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for each of 1 nodegroups in cluster "dr-eks"
[ℹ]  2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create nodegroup "ng-default-v1" } } }
[ℹ]  checking cluster stack for missing resources
[!]  retryable error (Throttling: Rate exceeded
    status code: 400, request id: d3b6952e-96be-46af-8344-ebfab654af74) from cloudformation/ListStacks - will retry after delay of 1.580097402s
[ℹ]  cluster stack is missing resources for Fargate
[ℹ]  adding missing resources to cluster stack
[ℹ]  cluster stack is missing resources for Fargate
[ℹ]  adding missing resources to cluster stack
[ℹ]  cluster stack is missing resources for Fargate
[ℹ]  adding missing resources to cluster stack
[ℹ]  re-building cluster stack "eksctl-dr-eks-cluster"
[ℹ]  updating stack to add new resources [PolicyCloudWatchMetrics PolicyNLB ServiceRole] and outputs []
[ℹ]  re-building cluster stack "eksctl-dr-eks-cluster"
[ℹ]  updating stack to add new resources [PolicyCloudWatchMetrics PolicyNLB ServiceRole] and outputs []
[ℹ]  re-building cluster stack "eksctl-dr-eks-cluster"
[ℹ]  updating stack to add new resources [PolicyCloudWatchMetrics PolicyNLB ServiceRole] and outputs []
[!]  retryable error (Throttling: Rate exceeded
    status code: 400, request id: 97d9c4b0-90ac-4b8d-914e-5a679d552249) from cloudformation/ListStacks - will retry after delay of 2.254277568s
[ℹ]  1 nodegroup (ng-orch-compute-v1) was included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for each of 1 nodegroups in cluster "dr-eks"
[ℹ]  2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create nodegroup "ng-orch-compute-v1" } } }
[ℹ]  checking cluster stack for missing resources
[ℹ]  cluster stack is missing resources for Fargate
[ℹ]  adding missing resources to cluster stack
[ℹ]  re-building cluster stack "eksctl-dr-eks-cluster"
[ℹ]  updating stack to add new resources [PolicyCloudWatchMetrics PolicyNLB ServiceRole] and outputs []
[ℹ]  nodegroup "ng-app-v2" will use "ami-024ce53b56277d5d5" [AmazonLinux2/1.14]
[ℹ]  using EC2 key pair "kubernetes.prod.vpc.mindtickle.com-b9:75:5f:af:aa:21:30:e7:28:ed:fe:ba:3a:ae:a2:76"
[ℹ]  1 nodegroup (ng-app-v2) was included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for each of 1 nodegroups in cluster "dr-eks"
[ℹ]  2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create nodegroup "ng-app-v2" } } }
[ℹ]  checking cluster stack for missing resources
[ℹ]  cluster stack is missing resources for Fargate
[ℹ]  adding missing resources to cluster stack
[ℹ]  re-building cluster stack "eksctl-dr-eks-cluster"
[ℹ]  updating stack to add new resources [PolicyCloudWatchMetrics PolicyNLB ServiceRole] and outputs []
[!]  error executing Cloudformation changeSet eksctl-update-cluster-1585031312 in stack eksctl-dr-eks-cluster. Check the Cloudformation console for further details
[ℹ]  1 error(s) occurred and nodegroups haven't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete nodegroup --region=ap-southeast-1 --cluster=dr-eks --name=<name>' for each of the failed nodegroup
[✖]  executing CloudFormation ChangeSet "eksctl-update-cluster-1585031312" for stack "eksctl-dr-eks-cluster": ChangeSetNotFound: ChangeSet [eksctl-update-cluster-1585031312] does not exist
    status code: 404, request id: cb56d0d5-a96f-4075-9510-dab311f7a76f
Error: failed to create nodegroups for cluster "dr-eks"
Error: failed to create nodegroups for cluster "dr-eks"
[!]  error executing Cloudformation changeSet eksctl-update-cluster-1585031312 in stack eksctl-dr-eks-cluster. Check the Cloudformation console for further details
[ℹ]  1 error(s) occurred and nodegroups haven't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete nodegroup --region=ap-southeast-1 --cluster=dr-eks --name=<name>' for each of the failed nodegroup
[✖]  executing CloudFormation ChangeSet "eksctl-update-cluster-1585031312" for stack "eksctl-dr-eks-cluster": ChangeSetNotFound: ChangeSet [eksctl-update-cluster-1585031312] does not exist
    status code: 404, request id: 9eba20af-474d-4685-a4f9-03222558db21
[!]  retryable error (RequestError: send request failed
caused by: Post https://cloudformation.ap-southeast-1.amazonaws.com/: EOF) from cloudformation/DescribeChangeSet - will retry after delay of 40.890236ms
[ℹ]  building nodegroup stack "eksctl-dr-eks-nodegroup-ng-monitoring-v1"
[ℹ]  deploying stack "eksctl-dr-eks-nodegroup-ng-monitoring-v1"
[ℹ]  adding identity "arn:aws:iam::191195949309:role/devops-eks-prod-node" to auth ConfigMap
[ℹ]  nodegroup "ng-monitoring-v1" has 0 node(s)
[ℹ]  waiting for at least 2 node(s) to become ready in "ng-monitoring-v1"
[ℹ]  nodegroup "ng-monitoring-v1" has 2 node(s)
[ℹ]  node "ip-10-10-163-45.ap-southeast-1.compute.internal" is ready
[ℹ]  node "ip-10-10-187-228.ap-southeast-1.compute.internal" is ready
[✔]  created 1 nodegroup(s) in cluster "dr-eks"
[✔]  created 0 managed nodegroup(s) in cluster "dr-eks"
[ℹ]  checking security group configuration for all nodegroups
[ℹ]  all nodegroups have up-to-date configuration
Error: failed to create nodegroups for cluster "dr-eks"
[ℹ]  1 error(s) occurred and nodegroups haven't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete nodegroup --region=ap-southeast-1 --cluster=dr-eks --name=<name>' for each of the failed nodegroup
[✖]  waiting for CloudFormation changeset "eksctl-update-cluster-1585031314" for stack "eksctl-dr-eks-cluster": RequestCanceled: waiter context canceled
caused by: context deadline exceeded
Error: failed to create nodegroups for cluster "dr-eks"
[ℹ]  1 error(s) occurred and nodegroups haven't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete nodegroup --region=ap-southeast-1 --cluster=dr-eks --name=<name>' for each of the failed nodegroup
[✖]  waiting for CloudFormation changeset "eksctl-update-cluster-1585031316" for stack "eksctl-dr-eks-cluster": RequestCanceled: waiter context canceled
caused by: context deadline exceeded
+ date
Tue Mar 24 06:53:35 UTC 2020
+ set +ex

vineet-krishna commented 4 years ago

Well, this issue does not seem to occur when I am using EKSCTL version 0.13.0 in my docker container. But this causes me to downgrade my EKS version to 1.14 as in EKS version 1.15 --allow-privileged=true flag does not exist which causes node groups to fail while trying to join the cluster.

cPu1 commented 4 years ago

[ℹ]  cluster stack is missing resources for Fargate
[ℹ]  adding missing resources to cluster stack

The error suggests that for each of the create nodegroup commands, eksctl is trying to modify the cluster stack because it's missing resources for Fargate[1], and since multiple processes are attempting to update the stack, the stack update (more specifically, execution of the changeset) is failing.

[✖]  executing CloudFormation ChangeSet "eksctl-update-cluster-1585031312" for stack "eksctl-dr-eks-cluster": ChangeSetNotFound: ChangeSet [eksctl-update-cluster-1585031312] does not exist

The execution of the changeset fails because after a stack update is initiated, all previous changesets are invalidated.

Not all commands in eksctl are safe for concurrent usage (since there's no shared lock) and eksctl create nodegroup is one of them in this case because it's trying to update the cluster stack.

What you're trying to achieve – creating nodegroups in parallel – is already built into eksctl create nodegroup -f config.yaml as it parallelises creation of all new nodegroups in the specified config file. To fix the error, you need to move all nodegroups to a single ClusterConfig file and run eksctl create nodegroup -f config-with-all-nodegroups.yaml only once.

[1]: This is actually a minor (harmless) bug. While eksctl isn't actually adding the resources that are missing for Fargate, it shouldn't attempt the stack update if Fargate profiles aren't being used.

martina-if commented 4 years ago

I am closing this issue but please feel free to reopen it if you think there's something we haven't addressed.

eksctl-io / eksctl

Unable to launch eks nodegroups from inside docker container using eksctl #1969