aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 321 forks source link

[EKS] [request]: Allow for optional Auto Bootstrapping on AL2023 similar to what is provided on AL2 - decouple cluster and nodegroups #2455

Open rns350 opened 1 week ago

rns350 commented 1 week ago

Community Note

Tell us about your request On AL2 there was a script provided at /etc/eks/bootstrap.sh that could be called on node startup. Given the name of the cluster, it would fetch info about the cluster via the describe-cluster API that is needed to connect to the API server. In AL2023, this script was removed due to observed API throttling when many nodes tried to join the cluster at the same time, all calling the describe-cluster operation. Now, for self-managed node groups, this info needs to be provided in a NodeConfig manifest in the user data.

This is perfectly reasonable and a more efficient use of resources, especially for large clusters; however, for our cluster running only a few nodes, it adds another step to the bootstrapping process. In the current state with AL2023, we just end up making a describe cluster call ourselves before running out our node group cloud formation template to gather this information and embed it in the user data using sed commands. With AL2, we could deploy the nodegroup and cluster fresh with eachother, since all we needed to know about the cluster to deploy the nodegroups was its name - this can be predicted in advance. In the current state with AL2023, one of the required parameters in the API endpoint, which includes a random ID and cannot be predicted, so the nodegroup and cluster deployments are now coupled.

We'd like to have the option to leave this work to the Node, so that a bootstrap script can gather the details from the describe-cluster API and embed them into the NodeConfig yaml. This could be an opt-in feature for those who want it, and would once again decouple the cluster and self-managed managed nodegroup deployments.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? With AL2, we need only provide the cluster name to bootstrap the nodes to the eks cluster. This was a really nice feature because we could always infer what our cluster would be called even if it hadn’t been created yet. There is a random ID in the api-server endpoint, so we don’t have a way to predict It if the cluster needs to be stood up fresh again. This means that with AL2023, the cluster and node group deployments have become coupled - the cluster must finish deploying before we can discern properties of it required in the nodegroup user data. For smaller clusters, this is a larger detriment than the benefit given by the removal of bootstrapping.

I no longer see a method for decoupling the deployment of the cluster and self-managed node groups, since the nodes no longer fetch the cluster information themselves. Having the option to opt into using the bootstrap script would solve this problem, but we are open to alternative solutions. Having this option speeds up the initial deployment process if we need to stand up a new environment or cycle resources.

Are you currently working around this issue? Yes - we just updated to AL2023 and now gather the data via a describe-cluster call. We then embed the details into the CF template before deploying. While this works, it means that the node group deployment is now dependent on the cluster deployment, whereas it wasn't before. We can predict the cluster name and provide it without the cluster being created; we can't do this with the API endpoint. We will otherwise need to couple these two deployment pipelines.

Additional context We have our dev cluster running on the AL2023 images already - other than this one hitch, the feature improvements on the new AMI are great.

dims commented 1 week ago

@rns350 would be good to surface this in https://github.com/awslabs/amazon-eks-ami/issues as well.

rns350 commented 1 week ago

Hey @dims , thank you for the advice. I surfaced the feature request here - https://github.com/awslabs/amazon-eks-ami/issues/2029