cloudposse / terraform-aws-eks-node-group

Terraform module to provision a fully managed AWS EKS Node Group
https://cloudposse.com/accelerate
Apache License 2.0
90 stars 128 forks source link

Autoscaling does not work when scaling from zero #109

Closed ib-ak closed 3 months ago

ib-ak commented 2 years ago

Found a bug? Maybe our Slack Community can help.

Slack Community

Describe the Bug

ASG group created by Node group is missing the cluster-autoscaler tags. Autoscaling does not work when scaling from zero without these tags.

Expected Behavior

ASG should be autoscalable from node count = 0

Steps to Reproduce

Steps to reproduce the behavior: 1 create with node group with min count = desired count = 0 and node labels and taints

  1. schedule a pod with the same label and taints
  2. pod remain unschedulable since autoscaling fails to add a node.

Screenshots

If applicable, add screenshots or logs to help explain your problem.

Environment (please complete the following information):

NA Anything that will help us triage the bug will help. Here are some ideas: NA

Additional Context

Possible solution: Tag ASG owned by eks node group after the fact.

Nuru commented 2 years ago

Closing because this functionality is not supported by AWS. See https://github.com/aws/containers-roadmap/issues/608

KhannaAbhinav commented 2 years ago

@Nuru will you consider using this as workaround ? https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_group_tag

you can use tags_all publish by managed node group and send it to autoscaling group tag.

Nuru commented 2 years ago

Reopening to consider implementing via aws_autoscaling_group_tag as described in https://github.com/aws/containers-roadmap/issues/608#issuecomment-1091628621

Nuru commented 1 year ago

@KhannaAbhinav I tried using the autoscaling_group_tag, but could not get it to work. The problem is that each resource only manages 1 tag, so you have to use count or for_each to create all the tags, which means the number of tags has to be known at "plan" time. No matter what tricks I tried, I could not get the number to be known at plan time when running our test suite. And with that error, the entire module stops working, so at the moment it is not worth the risk.

I might later try something like always deploying 5 or 10 tags or some computed number that is going to be clearly available at plan time, but it will probably a long time before I get to it.