[FEATURE]EKS NodeGroups scalability

srinivasreddych commented 4 months ago

Is your feature request related to a problem? Please describe. Related to modules/compute/eks

Describe the solution you'd like The current manifests deploy EKS Managed node groups with desired count of atleast 1. Test if the workloads can scale with 0 as the starting capacity, so we can save $ for customers

srinivasreddych commented 4 months ago

@a13zen Feel free to add more context about the ask here

a13zen commented 4 months ago

Testing by setting the desired/minimum to 0 sees the ASG terminating but then deploying 2 nodes again. This could be due to the base workload deployed by the EKS module

srinivasreddych commented 4 months ago

Hey @a13zen I was able to test the workflow and here is an update:

when a GPU (for example) nodegroup (NG) is requested via the eks module manifest, the user is expected to declare the labels. for ex: usage: gpu. The EKS module will add those labels to the NG and also add them as tags as described here, which is required by Cluster Autoscaler (CA) to scale from Zero.

k8s.io/cluster-autoscaler/node-template/label/usage: gpu

Expectation: when a user launches a GPU pod/job (for this context), the CA will query the tags on the GPU NG and scale out appropriately, thereby running the GPU pod/job. When the GPU NG is launched, it is expected behavior that aws-cni, kube-proxy, nvidia-device-plugin will be launched. Once the GPU pod functionality is executed, the EC2 instance will be terminated by CA.

Having said the above, i am thinking off a design where i would refactor the EKS module to launch a system NG with m5.large instance type always to accommodate drivers, system pods etc and let the user declare the required NGs as per requirement. Thoughts?

a13zen commented 4 months ago

Yes, having a simple system NG with small instances could be a good middle ground for sure. Do we know if m5.large would be sufficient for the default services deployed by the EKS module?

srinivasreddych commented 4 months ago

From my understanding, m5.large (2vCPU, 8gb RAM) should be sufficient, but depending on the number of plugins/drivers we/user deploys, the count of them should be >1. So starting with instance count = 2 should be a safe bet. Thoughts?

awslabs / idf-modules

[FEATURE]EKS NodeGroups scalability #197