Open srinivasreddych opened 4 months ago
@a13zen Feel free to add more context about the ask here
Testing by setting the desired/minimum to 0 sees the ASG terminating but then deploying 2 nodes again. This could be due to the base workload deployed by the EKS module
Hey @a13zen I was able to test the workflow and here is an update:
usage: gpu
. The EKS module will add those labels to the NG and also add them as tags as described here, which is required by Cluster Autoscaler (CA) to scale from Zero. k8s.io/cluster-autoscaler/node-template/label/usage: gpu
Expectation: when a user launches a GPU pod/job (for this context), the CA will query the tags on the GPU NG and scale out appropriately, thereby running the GPU pod/job. When the GPU NG is launched, it is expected behavior that aws-cni
, kube-proxy
, nvidia-device-plugin
will be launched. Once the GPU pod functionality is executed, the EC2 instance will be terminated by CA.
Having said the above, i am thinking off a design where i would refactor the EKS module to launch a system NG with m5.large
instance type always to accommodate drivers, system pods etc and let the user declare the required NGs as per requirement. Thoughts?
Yes, having a simple system NG with small instances could be a good middle ground for sure. Do we know if m5.large would be sufficient for the default services deployed by the EKS module?
From my understanding, m5.large (2vCPU, 8gb RAM)
should be sufficient, but depending on the number of plugins/drivers we/user deploys, the count of them should be >1
. So starting with instance count = 2 should be a safe bet. Thoughts?
Is your feature request related to a problem? Please describe. Related to
modules/compute/eks
Describe the solution you'd like The current manifests deploy EKS Managed node groups with desired count of atleast 1. Test if the workloads can scale with 0 as the starting capacity, so we can save $ for customers