cloudbees-oss / terraform-aws-cloudbees-ci-eks-addon

CloudBees CI Add-on for AWS EKS
https://registry.terraform.io/modules/cloudbees/cloudbees-ci-eks-addon/aws
MIT License
10 stars 12 forks source link

[Blueprints, 02-at-scale]: Team-b volume node affinity conflict after recovering from Hibernation #195

Open carlosrodlop opened 3 months ago

carlosrodlop commented 3 months ago

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration. The reproduction MUST be executable by running terraform init && terraform apply without any further changes.

If your request is for a new feature, please use the Feature request template.

⚠️ Note

Before you submit an issue, please perform the following first:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Reproduction Code [Required]

Steps to reproduce the behaviour:

It is random behaviour.

After recovering from Hibernation and re-provisioning team-b the following error can be read from kubernetes events

Normal   NotTriggerScaleUp  46s (x47 over 78m)    cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict, 1 node(s) had untolerated taint {dedicated: build-linux-l}, 1 node(s) had untolerated taint {dedicated: build-linux-xl}, 2 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {dedicated: build-windows}

It is not related to the issue explained on the article Autoscaling issue when provisioning controllers in Multi AZ Environment because the storage class is already using wait for first customer.

Expected behavior

Team-b recover successfully from Hibernation

Actual behavior

Team-b does not recover successfully from Hibernation

Terminal Output Screenshot(s)

Additional context

Explore the option to use allowed topologies https://github.com/jenkins-infra/aws/blob/09548bf41176b32fb91f1a3c915829032e4e8ec1/eks-public-cluster.tf#L247-L282 that it is aligned with:

carlosrodlop commented 1 month ago

It addresses https://github.com/cloudbees/terraform-aws-cloudbees-ci-eks-addon/issues/51

carlosrodlop commented 1 month ago

51 alone is not enough it requires to ensure that node groups using gp3 at least deploy one node in the same Az as defined in the topology constraint for the SC

carlosrodlop commented 1 month ago

Idea: Node Group using Gp3 as Storage Class, divide into 2 different node group including for one of them subnet_ids same to the GP3 SC topology. For example cb_apps = cb_apps_aza + cb_azbc Ref: https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/eks-managed-node-group?tab=inputs

carlosrodlop commented 2 weeks ago

[Node Group, single AZ] seems possible https://tanmay-bhat.medium.com/how-to-migrate-a-node-group-from-multi-az-to-single-az-in-aws-eks-73b0dc553ed. But it would be more interesting to ensure autoscaler does not delete nodes from a particular AZ (the AZ you constraint to be your EBS controllers) and share the node pools for EBS and EFS controllers https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/eks-managed-node-group?tab=inputs ==> placement_group_az