eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.9k stars 1.4k forks source link

Creating cluster timeout error #2201

Closed ArturChe closed 4 years ago

ArturChe commented 4 years ago

What happened? Cluster creation error output:

Create cluster...
[ℹ]  eksctl version 0.20.0-rc.0
[ℹ]  using region us-east-1
[✔]  using existing VPC (vpc-) and subnets (private:[subnet- subnet- ] public:[])
[!]  custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets
[ℹ]  nodegroup "linux-ng" will use "ami-06d4f570358b1b626" [AmazonLinux2/1.15]
[ℹ]  nodegroup "windows-ng" will use "ami-0c80b2e9538f07e08" [WindowsServer2019FullContainer/1.15]
[ℹ]  using Kubernetes version 1.15
[ℹ]  creating EKS cluster "dev-workers" in "us-east-1" region with managed nodes and un-managed nodes
[ℹ]  3 nodegroups (linux-mng, linux-ng, windows-ng) were included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for cluster itself and 2 nodegroup stack(s)
[ℹ]  will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s)
[ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-1 --cluster=dev-workers'
[ℹ]  CloudWatch logging will not be enabled for cluster "dev-workers" in "us-east-1"
[ℹ]  you can enable it with 'eksctl utils update-cluster-logging --region=us-east-1 --cluster=dev-workers'
[ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "dev-workers" in "us-east-1"
[ℹ]  2 sequential tasks: { create cluster control plane "dev-workers", 2 parallel sub-tasks: { install Windows VPC controller, 3 parallel sub-tasks: { create nodegroup "linux-ng", create nodegroup "windows-ng", create managed nodegroup "linux-mng" } } }
[ℹ]  building cluster stack "eksctl-dev-workers-cluster"
[ℹ]  deploying stack "eksctl-dev-workers-cluster"
[ℹ]  building managed nodegroup stack "eksctl-dev-workers-nodegroup-linux-mng"
[ℹ]  building nodegroup stack "eksctl-dev-workers-nodegroup-linux-ng"
[ℹ]  building nodegroup stack "eksctl-dev-workers-nodegroup-windows-ng"
[ℹ]  --nodes-max=1 was set automatically for nodegroup windows-ng
[ℹ]  --nodes-max=1 was set automatically for nodegroup linux-ng
[ℹ]  deploying stack "eksctl-dev-workers-nodegroup-linux-mng"
[ℹ]  deploying stack "eksctl-dev-workers-nodegroup-windows-ng"
[ℹ]  deploying stack "eksctl-dev-workers-nodegroup-linux-ng"[!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-east-1 --name=dev-workers'
[✖]  getting list of API resources for raw REST client: Get "https://FFFFFF.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp 34.195.116.116:443: i/o timeout

What you expected to happen? Output with success cluster creation.

How to reproduce it? Create a cluster with windows supports from a file.

Anything else we need to know? In CloudFormation I can see that all cluster stacks created successfully. All EC2 instances are running (including windows node). It cannot be reproduced on eksctl version 0.19.0.

Versions

$ eksctl version
0.20.0-rc.0
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:40:13Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
sayboras commented 4 years ago

Just curious if you saw any issue in CF. Additionally, can you share the config file that you used to create the above clustter.

ArturChe commented 4 years ago

@sayboras the file is:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: cluster
  region: us-east-1
  version: '1.16'

vpc:
  subnets:
    private:
      us-east-1a: { id: subnet-ffffff}
      us-east-1b: { id: subnet-ffffff}

iam:
  withOIDC: true

managedNodeGroups:
  - name: linux-mng
    instanceType: t2.medium
    minSize: 2
    privateNetworking: true

nodeGroups:
  - name: linux-ng
    instanceType: t2.small
    minSize: 1
    privateNetworking: true
  - name: windows-ng
    instanceType: t3.large
    minSize: 1
    volumeSize: 100
    amiFamily: WindowsServer2019FullContainer
    privateNetworking: true

Sometimes it does not create the CF stack for windows node group, sometimes does, but always fails with a timeout error. I have downgraded tool to 0.19.0 and it creates the cluster without any problems.

martina-if commented 4 years ago

Hi @ArturChe, this looks like a transient error on AWS. Can you retry and see if you can reproduce this with eksctl 0.20.0?

ArturChe commented 4 years ago

@martina-if I have tried with eksctl 0.20.0 and it can not be reproduced. Will stay on this version. I think the issue can be closed now.

dave-meier commented 4 years ago

This is happening every time for me on 0.20.0. How can I get past it?

PS D:\dave> eksctl create cluster -f .\eks-cluster-spec.yaml --install-vpc-controllers [ℹ] eksctl version 0.20.0 [ℹ] using region us-west-2 [✔] using existing VPC (vpc-xxx) and subnets (private:[subnet-xxx subnet-yyy] public:[]) [!] custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets [ℹ] nodegroup "windows-ng" will use "ami-0ee42fc568b2881e1" [WindowsServer2019CoreContainer/1.16] [ℹ] using EC2 key pair "Development-01.2020-Windows" [ℹ] using EC2 key pair "Development-01.2020-Windows" [ℹ] using Kubernetes version 1.16 [ℹ] creating EKS cluster "dave-eks" in "us-west-2" region with managed nodes and un-managed nodes [ℹ] 2 nodegroups (linux-ng, windows-ng) were included (based on the include/exclude rules) [ℹ] will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s) [ℹ] will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s) [ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=dave-eks' [ℹ] CloudWatch logging will not be enabled for cluster "dave-eks" in "us-west-2" [ℹ] you can enable it with 'eksctl utils update-cluster-logging --region=us-west-2 --cluster=dave-eks' [ℹ] Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "dave-eks" in "us-west-2" [ℹ] 2 sequential tasks: { create cluster control plane "dave-eks", 2 parallel sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } } [ℹ] building cluster stack "eksctl-dave-eks-cluster" [ℹ] deploying stack "eksctl-dave-eks-cluster" [ℹ] building managed nodegroup stack "eke-eks-nodegroup-windows-ng" [ buil building nodegroup stack "eksctl-dave-eks-nodegroup-windows-ng" [ℹ] --nodes-min=1 was set automatically for nodegroup windows-ng [ℹ] --nodes-max=1 was set automatically for nodegroup windows-ng [ℹ] deploying stack "eksctl-dave-eks-nodegroup-linux-ng" [ℹ] deploying stack "eksctl-dave-eks-nodegroup-windows-ng" [!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console [ℹ] to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks' [✖] getting list of API resources for raw REST client: Get "https://4C0128CED50557CE2B2B3DEA032A0597.gr7.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 52.12.217.181:443: i/o timeout Error: failed to create cluster "dave-eks"


eks-cluster-spec.yaml:

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: dave-eks
  region: us-west-2
  version: '1.16'

vpc:
  id: "vpc-05d662efb65a29dac"
  cidr: "10.12.0.0/16"
  subnets:
    private:
      us-west-2b:
          id: "subnet-xxxx"
          cidr: "10.12.0.128/25"
      us-west-2a:
          id: "subnet-yyyy"
          cidr: "10.12.10.0/25"      
managedNodeGroups:
  - name: linux-ng
    instanceType: t2.large
    desiredCapacity: 1
    privateNetworking: true # if only 'Private' subnets are given, this must be enabled
    ssh:
      publicKeyName: 'Development-01.2020-Windows'
      allow: true

nodeGroups:
  - name: windows-ng
    instanceType: m5.large
    desiredCapacity: 1
    privateNetworking: true # if only 'Private' subnets are given, this must be enabled
    securityGroups:
      withShared: true
      withLocal: true
      attachIDs: ['sg-xxx']
    ssh:
      publicKeyName: 'Development-01.2020-Windows'
      allow: true
    volumeSize: 100
    amiFamily: WindowsServer2019CoreContainer
dave-meier commented 4 years ago

After retrying the same thing 5 times, it finally worked. Perhaps the timeout is set too low?

For the case where it worked, I see that the linux and the windows EC2 instances are in the same subnet. Not sure if that has any bearing, but on one of the failed attempts to create the cluster, I noticed that the instances were not on the same subnet. Of course, it's required to specify at least 2 subnets, so I wouldn't think this is a requirement that they must be on the same subnet.

dave-meier commented 4 years ago

I'm now using 2 private and 2 public subnets, and that worked the first time. The master node is on one of the 2 public subnets, and the worker node is on the other public subnet. That being said, I still see the timeout quite often.

martina-if commented 4 years ago

Hi @dave-meier thanks for reporting this. If it ended up working I am not sure why these timeouts happened, I am thinking it could be that the windows ngs take longer to bootstrap. The workaround for this is to increase the timeout using eksctl create cluster --timeout ....

dave-meier commented 4 years ago

Thanks @martina-if

I tried the timeout param, but still got the problem. On the actual REST call that times out, there is a 32s timeout specified. Is there a way to increase that timeout? Perhaps that is the problem. I checked the cluster and only the linux node is present after this failure.

PS D:\dave> eksctl create cluster -f .\eks-cluster-spec-my-vpc-with-private-subnets.yaml --install-vpc-controllers --timeout 40m
[ℹ]  eksctl version 0.20.0
[ℹ]  using region us-west-2
[✔]  using existing VPC (vpc-xxx) and subnets (private:[subnet-yyy subnet-zzz] public:[])
[!]  custom VPC/subnets will be used; if resulting cluster doesn't function as expected, make sure to review the configuration of VPC/subnets
[ℹ]  nodegroup "windows-ng" will use "ami-0ee42fc568b2881e1" [WindowsServer2019CoreContainer/1.16]
[ℹ]  using EC2 key pair "Development-01.2020-Windows"
[ℹ]  using EC2 key pair "Development-01.2020-Windows"
[ℹ]  using Kubernetes version 1.16
[ℹ]  creating EKS cluster "dave-eks" in "us-west-2" region with managed nodes and un-managed nodes
[ℹ]  2 nodegroups (linux-ng, windows-ng) were included (based on the include/exclude rules)
[ℹ]  will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s)
[ℹ]  will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s)
[ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=dave-eks'
[ℹ]  CloudWatch logging will not be enabled for cluster "dave-eks" in "us-west-2"
[ℹ]  you can enable it with 'eksctl utils update-cluster-logging --region=us-west-2 --cluster=dave-eks'
[ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "dave-eks" in "us-west-2"
[ℹ]  2 sequential tasks: { create cluster control plane "dave-eks", 2 parallel sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } }
[ℹ]  building cluster stack "eksctl-dave-eks-cluster"
[ℹ]  deploying stack "eksctl-dave-eks-cluster"
[ℹ]  building managed nodegroup stack "eksctl-dave-eks-nodegroup-linux-ng"
[ℹ]  building nodegroup stack "eksctl-dave-eks-nodegroup-windows-ng"
[ℹ]  --nodes-min=1 was set automatically for nodegroup windows-ng
[ℹ]  --nodes-max=1 was set automatically for nodegroup windows-ng
[ℹ]  deploying stack "eksctl-dave-eks-nodegroup-linux-ng"
[ℹ]  deploying stack "eksctl-dave-eks-nodegroup-windows-ng"
[!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
[✖]  getting list of API resources for raw REST client: Get "https://2EF037DA76941528DE07C60106AD9925.gr7.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 44.231.153.96:443: i/o timeout
Error: failed to create cluster "dave-eks"

PS D:\dave> aws eks --region us-west-2 update-kubeconfig --name dave-eks
Added new context arn:aws:eks:us-west-2:719681605826:cluster/dave-eks to C:\Users\dmeier\.kube\config

PS D:\dave> kubectl get no -o wide
NAME                                         STATUS   ROLES    AGE   VERSION              INTERNAL-IP    EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-12-40-219.us-west-2.compute.internal   Ready    <none>   50m   v1.16.8-eks-e16311   10.12.40.219   <none>        Amazon Linux 2   4.14.181-140.257.amzn2.x86_64   docker://19.3.6
ArturChe commented 4 years ago

@martina-if @dave-meier Does --timeout works with -f parameter?

dave-meier commented 4 years ago

@ArturChe - the help indicates that --timeout can work alongside -f, yes.

Still getting the problem and ran with --verbose 5. At the end this is what I see:

2020-06-25T23:00:06Z [▶]  completed task: create cluster control plane "dave-eks2"
2020-06-25T23:00:06Z [▶]  started task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } }
2020-06-25T23:00:06Z [▶]  started task: install Windows VPC controller
2020-06-25T23:00:06Z [▶]  started task: install Windows VPC controller
2020-06-25T23:00:36Z [▶]  failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-2020-06-25T23:00:36Z [▶]  failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-06-25T23:00:36Z [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check Clou2020-06-25T23:00:36Z [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2020-06-25T23:00:36Z [ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks2'
2020-06-25T23:00:36Z [✖]  getting list of API resources for raw REST client: Get "https://858FA053B71A5400107BB41AD70CB563.yl4.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 52.10.237.158:443: i/o timeout
Error: failed to create cluster "dave-eks2"
martina-if commented 4 years ago

Hi @dave-meier ,

there is no way to increase that specific timeout AFAIK. I see two things you could check, the first is the VPC and the second the cloud formation logs. Can you post the error for the resources that failed creating?

dave-meier commented 4 years ago

Hi @martina-if

I have the correct labels on my private subnets and I have tested an EC2 instance with one of the subnets assigned. NAT / Internet access works fine from the test instance. When it fails I am left with the CF "eksctl-dave-eks-cluster" stack which appears to be fully successful. There are no CF stacks for the node groups. There is nothing in CloudWatch - is there a setting to enable that?

Output of "eksctl create cluster -f .\eks-cluster-spec-my-vpc-with-private-subnets.yaml --install-vpc-controllers --timeout 40m --verbose=4":


2020-07-02T17:43:11Z [Γû╢]  waiting for CloudFormation stack "eksctl-dave-eks-cluster"
2020-07-02T17:43:11Z [Γû╢]  done after 10m44.9668461s of waiting for CloudFormation stack "eksctl-dave-eks-cluster"
2020-07-02T17:43:11Z [Γû╢]  processing stack outputs
2020-07-02T17:43:11Z [Γû╢]  completed task: create cluster control plane "dave-eks"
2020-07-02T17:43:11Z [Γû╢]  started task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } }
2020-07-02T17:43:11Z [Γû╢]  started task: install Windows VPC controller
2020-07-02T17:43:11Z [Γû╢]  started task: install Windows VPC controller
2020-07-02T17:43:41Z [Γû╢]  failed task: install Windows VPC controller (will not run other sequential tasks)
2020-07-02T17:43:41Z [Γû╢]  failed task: install Windows VPC controller (will not run other sequential tasks)
2020-07-02T17:43:41Z [Γû╢]  failed task: 2 sequential sub-tasks: { install Windows VPC controller, 2 parallel sub-tasks: { create nodegroup "windows-ng", create managed nodegroup "linux-ng" } } (will not run other sequential tasks)
2020-07-02T17:43:41Z [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2020-07-02T17:43:41Z [ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=dave-eks'
2020-07-02T17:43:41Z [Γ£û]  getting list of API resources for raw REST client: Get "https://4B76B532F4C4AAB28E1F6E69F1BE7421.gr7.us-west-2.eks.amazonaws.com/api?timeout=32s": dial tcp 52.12.139.20:443: i/o timeout
eksctl : Error: failed to create cluster "dave-eks"
At line:1 char:1
+ eksctl create cluster -f .\eks-cluster-spec-my-vpc-with-private-subne ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Error: failed t...ster "dave-eks":String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError```
brycahta commented 4 years ago

+1, getting the exact same behavior as @dave-meier running on eksctl version 0.23.0. No CF stacks created for node groups and CF shows cluster creation as complete/successful so no error logs or events.

Using the Cluster with Linux and Windows workloads configuration from the AWS user guide:


---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: windows-prod
  region: us-west-2
  version: '1.17'  
managedNodeGroups:
  - name: linux-ng
    instanceType: t2.large
    minSize: 2

nodeGroups:
  - name: windows-ng
    instanceType: m5.large
    minSize: 2
    volumeSize: 100
    amiFamily: WindowsServer2019FullContainer
dave-meier commented 4 years ago

Hi @brycahta - I added my latest notes to #2382

michaelbeaumont commented 4 years ago

@brycahta Is the error reproducible or is it transient? Is that the exact config you're using? Which command are you running? I haven't been able to reproduce it with that config.

brycahta commented 4 years ago

@michaelbeaumont the error is transient and I could reproduce ~30% of the time.

I copied @dave-meier latest suggestion in #2382 stopped seeing the issue. However, I'm not testing this on a consistent basis so don't have any repro rates

pkit commented 2 years ago

Still the same error on 0.105.0

    2022-09-15 16:04:14 [ℹ]  eksctl version 0.105.0
    .....
    2022-09-15 16:15:56 [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
    2022-09-15 16:15:56 [ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-east-1 --name=dev'
    2022-09-15 16:15:56 [✖]  error creating Clientset: getting list of API resources for raw REST client: Get "https://3646D4BED90672B70F53263E4BC4CD4B.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp 52.44.60.175:443: i/o timeout

The CF stack was created successfully though. But obviously no nodegroups and no way to recovery...