Open cartermckinnon opened 1 year ago
@cartermckinnon What ever happened with #1004?
We're facing a similar issue where aws eks wait cluster-active
fails due to a transient timeout with the AWS API and then a node gets stuck without joining the cluster (which has other knock-on effects, wedging cluster-autoscaler).
2024-03-20T15:00:46+0000 [eks-bootstrap] INFO: --b64-cluster-ca or --apiserver-endpoint is not defined, describing cluster...
Connect timeout on endpoint URL: "https://eks.us-west-2.amazonaws.com/clusters/eks-prod-us-west-2"
Exited with error on line 358
It seems like the patch in #1004 would fix our problem, but it appears it was closed after sitting for a long time.
The best thing to do here is to pass --apiserver-endpoint
and --b64-cluster-ca
and avoid the DescribeCluster call entirely. This fallback mechanism has been removed in our AL2023 AMI's.
I'll see if we can reboot the PR, in any case.
(relayed from an internal ticket)
What happened:
aws eks wait cluster-active
may get rate-limited (TooManyRequestsException
) and cause the bootstrap script to terminate, instead of falling back to the retry logic aroundaws eks describe-cluster
.What you expected to happen:
The
describe-cluster
call should be retried the desired number of times, despite rate-limiting errors.