awslabs / idf-modules

Industry Data Framework (IDF) IAC modules repository
Apache License 2.0
25 stars 14 forks source link

[FEATURE]EKS NodeGroups scalability #197

Open srinivasreddych opened 4 months ago

srinivasreddych commented 4 months ago

Is your feature request related to a problem? Please describe. Related to modules/compute/eks

Describe the solution you'd like The current manifests deploy EKS Managed node groups with desired count of atleast 1. Test if the workloads can scale with 0 as the starting capacity, so we can save $ for customers

srinivasreddych commented 4 months ago

@a13zen Feel free to add more context about the ask here

a13zen commented 4 months ago

Testing by setting the desired/minimum to 0 sees the ASG terminating but then deploying 2 nodes again. This could be due to the base workload deployed by the EKS module

srinivasreddych commented 4 months ago

Hey @a13zen I was able to test the workflow and here is an update:

k8s.io/cluster-autoscaler/node-template/label/usage: gpu

Expectation: when a user launches a GPU pod/job (for this context), the CA will query the tags on the GPU NG and scale out appropriately, thereby running the GPU pod/job. When the GPU NG is launched, it is expected behavior that aws-cni, kube-proxy, nvidia-device-plugin will be launched. Once the GPU pod functionality is executed, the EC2 instance will be terminated by CA.

Having said the above, i am thinking off a design where i would refactor the EKS module to launch a system NG with m5.large instance type always to accommodate drivers, system pods etc and let the user declare the required NGs as per requirement. Thoughts?

a13zen commented 4 months ago

Yes, having a simple system NG with small instances could be a good middle ground for sure. Do we know if m5.large would be sufficient for the default services deployed by the EKS module?

srinivasreddych commented 4 months ago

From my understanding, m5.large (2vCPU, 8gb RAM) should be sufficient, but depending on the number of plugins/drivers we/user deploys, the count of them should be >1. So starting with instance count = 2 should be a safe bet. Thoughts?