Retry logic around describe-cluster doesn't handle rate-limiting

awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI

https://awslabs.github.io/amazon-eks-ami/

MIT No Attribution

2.39k stars 1.12k forks source link

Retry logic around describe-cluster doesn't handle rate-limiting #999

Open cartermckinnon opened 1 year ago

cartermckinnon commented 1 year ago

(relayed from an internal ticket)

What happened:

aws eks wait cluster-active may get rate-limited (TooManyRequestsException) and cause the bootstrap script to terminate, instead of falling back to the retry logic around aws eks describe-cluster.

What you expected to happen:

The describe-cluster call should be retried the desired number of times, despite rate-limiting errors.

orirawlings commented 3 months ago

@cartermckinnon What ever happened with #1004?

We're facing a similar issue where aws eks wait cluster-active fails due to a transient timeout with the AWS API and then a node gets stuck without joining the cluster (which has other knock-on effects, wedging cluster-autoscaler).

2024-03-20T15:00:46+0000 [eks-bootstrap] INFO: --b64-cluster-ca or --apiserver-endpoint is not defined, describing cluster...

Connect timeout on endpoint URL: "https://eks.us-west-2.amazonaws.com/clusters/eks-prod-us-west-2"
Exited with error on line 358

It seems like the patch in #1004 would fix our problem, but it appears it was closed after sitting for a long time.

cartermckinnon commented 2 weeks ago

The best thing to do here is to pass --apiserver-endpoint and --b64-cluster-ca and avoid the DescribeCluster call entirely. This fallback mechanism has been removed in our AL2023 AMI's.

I'll see if we can reboot the PR, in any case.