awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.46k stars 1.15k forks source link

feat(nodeadm): init with only cluster name #2029

Open rns350 opened 3 weeks ago

rns350 commented 3 weeks ago

What would you like to be added: On AL2 there was a script provided at /etc/eks/bootstrap.sh that could be called on node startup. Given the name of the cluster, it would fetch info about the cluster via the describe-cluster API that is needed to connect to the API server. In AL2023, this script was removed due to observed API throttling when many nodes tried to join the cluster at the same time, all calling the describe-cluster operation. Now, for self-managed node groups, this info needs to be provided in a NodeConfig manifest in the user data.

This is perfectly reasonable and a more efficient use of resources, especially for large clusters; however, for our cluster running only a few nodes, it adds another step to the bootstrapping process. In the current state with AL2023, we just end up making a describe cluster call ourselves before running out our node group cloud formation template to gather this information and embed it in the user data using sed commands. With AL2, we could deploy the nodegroup and cluster fresh with eachother, since all we needed to know about the cluster to deploy the nodegroups was its name - this can be predicted in advance. In the current state with AL2023, one of the required parameters is the API endpoint, which includes a random ID and cannot be predicted, so the nodegroup and cluster deployments are now coupled.

We'd like to have the option to leave this work to the Node, so that a bootstrap script can gather the details from the describe-cluster API and embed them into the NodeConfig yaml. This could be an opt-in feature for those who want it, and would once again decouple the cluster and self-managed nodegroup deployments.

Why is this needed: With AL2, we need only provide the cluster name to bootstrap the nodes to the eks cluster. This was a really nice feature because we could always infer what our cluster would be called even if it hadn’t been created yet. There is a random ID in the api-server endpoint, so we don’t have a way to predict It if the cluster needs to be stood up fresh again. This means that with AL2023, the cluster and node group deployments have become coupled - the cluster must finish deploying before we can discern properties of it required in the nodegroup user data. For smaller clusters, this is a larger detriment than the benefit given by the removal of bootstrapping.

I no longer see a method for decoupling the deployment of the cluster and self-managed node groups, since the nodes no longer fetch the cluster information themselves. Having the option to opt into using the bootstrap script would solve this problem, but we are open to alternative solutions. Having this option speeds up the initial deployment process if we need to stand up a new environment or cycle resources. We have our dev cluster running on the AL2023 images already - other than this one hitch, the feature improvements on the AL2023 AMI are great.

cartermckinnon commented 3 weeks ago

How much of a difference does this make in your deployment time?

This behavior was intentionally removed, because we don't believe it should be used in a production environment (and because many IaC setups can easily wire the cluster details to the nodes' user data -- you can do this with the return values of AWS::EKS::Cluster in a CFN template, for example). I'm not totally opposed to adding a feature gate for it, with a scary sounding name 😃 But I wouldn't expect this to be a significant optimization of deployment time for most users. We do not parallelize the control plane and node creation in our own tests, FWIW.

I'll chat with the team about this 👍

rns350 commented 3 weeks ago

Hey Carter! I understand that it won’t make a huge difference in deployment times. It would just fit better into our current deployment architecture. A couple other developers on our teams have asked for it. I think a feature gate would be a great idea and agree that it should not be default functionality.

Thanks for helping us with our request and let me know how that conversation goes.

rns350 commented 1 week ago

Hey @cartermckinnon, Just Checking back in on this to see if there's any more interest in having this added back from any conversations. Thank you again for the support.