eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.94k stars 1.41k forks source link

eksctl fails trying to provision resources in AZ's with insufficient capacity, such as us-east-1e #3816

Closed morancj closed 3 years ago

morancj commented 3 years ago

What were you trying to accomplish? Create a new simple cluster in us-east-1

What happened? eksctl create fails with EC2 Resource creation cancelled & EKS with Cannot create cluster 'test-cluster' because us-east-1e, the targeted availability zone, does not currently have sufficient capacity to support the cluster. Retry and choose from these availability zones: us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f:

2021-06-07 15:31:13 [✖]  AWS::EC2::Route/PublicSubnetRoute: CREATE_FAILED – "Resource creation cancelled"
2021-06-07 15:31:13 [✖]  AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPublicUSEAST1E: CREATE_FAILED – "Resource creation cancelled"
2021-06-07 15:31:13 [✖]  AWS::EC2::NatGateway/NATGateway: CREATE_FAILED – "Resource creation cancelled"
2021-06-07 15:31:13 [✖]  AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPrivateUSEAST1E: CREATE_FAILED – "Resource creation cancelled"
2021-06-07 15:31:13 [✖]  AWS::EKS::Cluster/ControlPlane: CREATE_FAILED – "Cannot create cluster 'test-cluster' because us-east-1e, the targeted availability zone, does not currently have sufficient capacity to support the cluster. Retry and choose from these availability zones: us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f (Service: AmazonEKS; Status Code: 400; Error Code: UnsupportedAvailabilityZoneException; Request ID: 6a95c4d0-6b0d-49e2-9a96-56a414ba592f; Proxy: null)"

and then instructs me to delete the cluster.

I note Creating and managing clusters advises use of the --zones flag, but this doesn't work with -f:

➜ eksctl create cluster -f cluster.yaml --zones=us-east-1a,us-east-1d
Error: cannot use --zones when --config-file/-f is set

Other related information

I've been advised to use these resource quota limits:

VPCs per region. Default 5. Suggested minimum 30. EC2 VPC Elastic IPs. Default 5. Suggested minimum 30. EC2 Classic Elastic IPs. Default 5. Suggested minimum 30. NAT gateways per AZ. Default 5. Suggested minimum 30. Internet Gateways per regions. Default 5. Suggested minimum 30.

Many regions do not have the ability to request increases for some services, such as EC2 Classic Elastic IPs. Per AWS' docs, Using Service Quotas request templates

A request template can include up to 10 quota increases.

source

There is also a limit of one request template, meaning the maximum number of regions for which one can request these 5 limits be increased is two, unless one submits further individual limit increase requests.

eu-west-1 is the only European region which allows requesting a limit increase for EC2 Classic IP's, further limiting region choice.

These all compound the effect of this issue. There are 5 closed issues around this AZ. Skipping us-east-1e (or better, handling insufficient capacity for any AZ) would be helpful.

How to reproduce it?

Start with a basic cluster.yaml like this:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-cluster
  region: us-east-1

nodeGroups:
  - name: ng-1
    instanceType: t3a.large
    desiredCapacity: 3
  - name: ng-2
    instanceType: t3a.large
    desiredCapacity: 2

eksctl will use random AZ's. If it chooses us-east-1e, creation will fail. In that case, using eksctl create cluster --dry-run -f cluster.yaml > test-cluster.yaml will generate a ClusterConfig like this:

test-cluster.yaml ```yaml apiVersion: eksctl.io/v1alpha5 availabilityZones: - us-east-1e - us-east-1c iam: vpcResourceControllerPolicy: true withOIDC: false kind: ClusterConfig metadata: name: test-cluster region: us-east-1 version: "1.19" nodeGroups: - amiFamily: AmazonLinux2 desiredCapacity: 3 disableIMDSv1: false disablePodIMDS: false iam: withAddonPolicies: albIngress: false appMesh: null appMeshPreview: null autoScaler: false certManager: false cloudWatch: false ebs: false efs: false externalDNS: false fsx: false imageBuilder: false xRay: false instanceSelector: {} instanceType: t3a.large labels: alpha.eksctl.io/cluster-name: test-cluster alpha.eksctl.io/nodegroup-name: ng-1 name: ng-1 privateNetworking: false securityGroups: withLocal: true withShared: true ssh: allow: false volumeIOPS: 3000 volumeSize: 80 volumeThroughput: 125 volumeType: gp3 - amiFamily: AmazonLinux2 desiredCapacity: 2 disableIMDSv1: false disablePodIMDS: false iam: withAddonPolicies: albIngress: false appMesh: null appMeshPreview: null autoScaler: false certManager: false cloudWatch: false ebs: false efs: false externalDNS: false fsx: false imageBuilder: false xRay: false instanceSelector: {} instanceType: t3a.large labels: alpha.eksctl.io/cluster-name: test-cluster alpha.eksctl.io/nodegroup-name: ng-2 name: ng-2 privateNetworking: false securityGroups: withLocal: true withShared: true ssh: allow: false volumeIOPS: 3000 volumeSize: 80 volumeThroughput: 125 volumeType: gp3 privateCluster: enabled: false vpc: autoAllocateIPv6: false cidr: 192.168.0.0/16 clusterEndpoints: privateAccess: false publicAccess: true manageSharedNodeSecurityGroupRules: true nat: gateway: Single ```

If I modify test-cluster.yaml to replace us-east-1, eksctl create cluster -f test-cluster.yaml succeeds. I realise this might not happen every time. :slightly_smiling_face:

Logs

Failure logs ```shell ➜ eksctl create cluster -f cluster.yaml 2021-06-07 14:57:11 [ℹ] eksctl version 0.51.0 2021-06-07 14:57:11 [ℹ] using region us-east-1 2021-06-07 14:57:12 [ℹ] setting availability zones to [us-east-1e us-east-1d] 2021-06-07 14:57:12 [ℹ] subnets for us-east-1e - public:192.168.0.0/19 private:192.168.64.0/19 2021-06-07 14:57:12 [ℹ] subnets for us-east-1d - public:192.168.32.0/19 private:192.168.96.0/19 2021-06-07 14:57:12 [ℹ] nodegroup "ng-1" will use "ami-0ef0c69399dbb5f3f" [AmazonLinux2/1.19] 2021-06-07 14:57:12 [ℹ] nodegroup "ng-2" will use "ami-0ef0c69399dbb5f3f" [AmazonLinux2/1.19] 2021-06-07 14:57:12 [ℹ] using Kubernetes version 1.19 2021-06-07 14:57:12 [ℹ] creating EKS cluster "test-cluster" in "us-east-1" region with un-managed nodes 2021-06-07 14:57:12 [ℹ] 2 nodegroups (ng-1, ng-2) were included (based on the include/exclude rules) 2021-06-07 14:57:12 [ℹ] will create a CloudFormation stack for cluster itself and 2 nodegroup stack(s) 2021-06-07 14:57:12 [ℹ] will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s) 2021-06-07 14:57:12 [ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-1 --cluster=test-cluster' 2021-06-07 14:57:12 [ℹ] CloudWatch logging will not be enabled for cluster "test-cluster" in "us-east-1" 2021-06-07 14:57:12 [ℹ] you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-east-1 --cluster=test-cluster' 2021-06-07 14:57:12 [ℹ] Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "test-cluster" in "us-east-1" 2021-06-07 14:57:12 [ℹ] 2 sequential tasks: { create cluster control plane "test-cluster", 3 sequential sub-tasks: { wait for control plane to become ready, create addons, 2 parallel sub-tasks: { create nodegroup "ng-1", create nodegroup "ng-2" } } } 2021-06-07 14:57:12 [ℹ] building cluster stack "eksctl-test-cluster-cluster" 2021-06-07 14:57:13 [ℹ] deploying stack "eksctl-test-cluster-cluster" 2021-06-07 14:57:43 [ℹ] waiting for CloudFormation stack "eksctl-test-cluster-cluster" 2021-06-07 14:58:14 [ℹ] waiting for CloudFormation stack "eksctl-test-cluster-cluster" 2021-06-07 14:59:14 [ℹ] waiting for CloudFormation stack "eksctl-test-cluster-cluster" 2021-06-07 14:59:14 [✖] unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-test-cluster-cluster" 2021-06-07 14:59:14 [ℹ] fetching stack events in attempt to troubleshoot the root cause of the failure 2021-06-07 14:59:15 [!] AWS::EC2::Subnet/SubnetPublicUSEAST1D: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::RouteTable/PrivateRouteTableUSEAST1D: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::Subnet/SubnetPrivateUSEAST1D: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::RouteTable/PublicRouteTable: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::VPCGatewayAttachment/VPCGatewayAttachment: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::Subnet/SubnetPrivateUSEAST1E: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::RouteTable/PrivateRouteTableUSEAST1E: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::SecurityGroup/ClusterSharedNodeSecurityGroup: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::IAM::Role/ServiceRole: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::SecurityGroup/ControlPlaneSecurityGroup: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::Route/PublicSubnetRoute: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPublicUSEAST1D: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::NatGateway/NATGateway: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPrivateUSEAST1E: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPublicUSEAST1E: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPrivateUSEAST1D: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::IAM::Policy/PolicyELBPermissions: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::IAM::Policy/PolicyCloudWatchMetrics: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [!] AWS::EC2::SecurityGroupIngress/IngressInterNodeGroupSG: DELETE_IN_PROGRESS 2021-06-07 14:59:15 [✖] AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPrivateUSEAST1E: CREATE_FAILED – "Resource creation cancelled" 2021-06-07 14:59:15 [✖] AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPrivateUSEAST1D: CREATE_FAILED – "Resource creation cancelled" 2021-06-07 14:59:15 [✖] AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPublicUSEAST1E: CREATE_FAILED – "Resource creation cancelled" 2021-06-07 14:59:15 [✖] AWS::EC2::NatGateway/NATGateway: CREATE_FAILED – "Resource creation cancelled" 2021-06-07 14:59:15 [✖] AWS::EC2::Route/PublicSubnetRoute: CREATE_FAILED – "Resource creation cancelled" 2021-06-07 14:59:15 [✖] AWS::EC2::SubnetRouteTableAssociation/RouteTableAssociationPublicUSEAST1D: CREATE_FAILED – "Resource creation cancelled" 2021-06-07 14:59:15 [✖] AWS::EKS::Cluster/ControlPlane: CREATE_FAILED – "Cannot create cluster 'test-cluster' because us-east-1e, the targeted availability zone, does not currently have sufficient capacity to support the cluster. Retry and choose from these availability zones: us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f (Service: AmazonEKS; Status Code: 400; Error Code: UnsupportedAvailabilityZoneException; Request ID: 1215c702-8018-4c6a-b922-b2dafe7249d4; Proxy: null)" 2021-06-07 14:59:15 [!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console 2021-06-07 14:59:15 [ℹ] to cleanup resources, run 'eksctl delete cluster --region=us-east-1 --name=test-cluster' 2021-06-07 14:59:15 [✖] ResourceNotReady: failed waiting for successful resource state Error: failed to create cluster "test-cluster" ```

Anything else we need to know?

What OS are you using? Ubuntu 20.04.1 LTS:

/etc/os-release ``` ➜ cat /etc/os-release NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.1 LTS (fossa-charmander X68)" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal ```

Are you using a downloaded binary or did you compile eksctl?

What type of AWS credentials are you using (i.e. default/named profile, MFA)? - please don't include actual credentials though!

Versions

➜ eksctl version
0.51.0
➜ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Callisto13 commented 3 years ago

I note Creating and managing clusters advises use of the --zones flag

There should be an equivalent top level availabilityZones: [] field, and/or other nodeGroup.availabilityZones: []/managedNodeGroup.availabilityZones: [] fields in the config file, could you try those?

morancj commented 3 years ago

Thanks. I found that, and setting the AZ's there worked around this issue. I thought I'd added it to my massive missive above; apparently not, apologies! Here it is (appended .txt for GitHub). cluster.yaml.txt

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
availabilityZones:
  - us-east-1a
  - us-east-1d

metadata:
  name: test-cluster
  region: us-east-1

nodeGroups:
  - name: ng-1
    instanceType: t3a.large
    desiredCapacity: 3
  - name: ng-2
    instanceType: t3a.large
    desiredCapacity: 2
Callisto13 commented 3 years ago

okay so just to tldr, the problem is that even if you specify the AZs in the config (knowing which ones have space), eksctl will create still create things wherever it wants?

morancj commented 3 years ago

The "problem" is that unless told otherwise, eksctl can select an AZ which is unable to fulfil the user's request.

IMO, either eksctl should check in advance if the AZ can fulfil the request, or, if that's not possible, avoid the us-east-1e AZ as it is often unable to provision these resources.

Callisto13 commented 3 years ago

🤔 I don't think we want eksctl to have opinions like that. Once we start checking for capacity in one thing, we open ourselves up to making decisions on capacity of everything else. Not to mention the whole mess of API calls that would require. Eksctl is designed on the premise that 'you know what you have, you tell us what to use'.

morancj commented 3 years ago

If https://github.com/weaveworks/eksctl/issues/118#issuecomment-406597480 were implemented, the UX would be much improved.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 3 years ago

This issue was closed because it has been stalled for 5 days with no activity.

oxr463 commented 3 years ago

I received this error:

2021-10-26 18:53:38 [✖]  AWS::EKS::Cluster/ControlPlane: CREATE_FAILED – "Cannot create cluster 'dev' because us-east-1e, the targeted availability zone, does not currently have sufficient capacity to support the cluster. Retry and choose from these availability zones: us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f (Service: AmazonEKS; Status Code: 400; Error Code: UnsupportedAvailabilityZoneException; Request ID: [ . . . REDACTED . . . ]; Proxy: null)"
2021-10-26 18:53:38 [!]  1 error(s) occurred and cluster hasn't been created 

I used the existing VPC example as a base, (See: https://github.com/weaveworks/eksctl/blob/main/examples/04-existing-vpc.yaml).

When I added the top-level availabilityZones it gave me this error:

Error: vpc.subnets and availabilityZones cannot be set at the same time
Callisto13 commented 3 years ago

@oxr463 thank you for asking. That error is intended. The availabilityZones setting is for ensuring new VPC resources are created where you want them to be. If you already have a VPC and are therefore setting subnets which have already been created in AZs, then there is nothing for the availabilityZones setting to do. Eksctl therefore errors rather than quietly ignoring that config.

jglick commented 2 years ago

Similar to #905? Is there any workaround for this? I am just looking for a reliable, scriptable way to create a cluster in a given region, using whatever AZs are on offer. Note that I am using a YAML config.

Callisto13 commented 2 years ago

cc @Himangini and the rest of the team since I am no longer working on this project

matthewbordas commented 2 years ago

I just ran into this issue today. I had to delete the CFN stack separately