carlosrodlop commented 3 months ago

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration. The reproduction MUST be executable by running terraform init && terraform apply without any further changes.

If your request is for a new feature, please use the Feature request template.

[X] I have searched the open/closed issues in this repository and my issue is not listed.
[X] I have checked that local tests are passing.
[X] If the issue is related to an AWS EKS add-on, I have searched the open/closed issues in the upstream aws-ia/terraform-aws-eks-blueprints and my issue is not listed.
[X] If the issue is related to an AWS EKS add-on, I have checked that upstream tests for the eks terraform blueprints add-on are passing.

⚠️ Note

Before you submit an issue, please perform the following first:

Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
Re-initialize the project root to pull down modules: terraform init
Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Module version [Required]:
Terraform version:
Provider version(s):

Reproduction Code [Required]

Steps to reproduce the behaviour:

It is random behaviour.

After recovering from Hibernation and re-provisioning team-b the following error can be read from kubernetes events

Normal   NotTriggerScaleUp  46s (x47 over 78m)    cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict, 1 node(s) had untolerated taint {dedicated: build-linux-l}, 1 node(s) had untolerated taint {dedicated: build-linux-xl}, 2 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {dedicated: build-windows}

It is not related to the issue explained on the article Autoscaling issue when provisioning controllers in Multi AZ Environment because the storage class is already using wait for first customer.

Expected behavior

Team-b recover successfully from Hibernation

Actual behavior

Team-b does not recover successfully from Hibernation

Terminal Output Screenshot(s)

Additional context

Explore the option to use allowed topologies https://github.com/jenkins-infra/aws/blob/09548bf41176b32fb91f1a3c915829032e4e8ec1/eks-public-cluster.tf#L247-L282 that it is aligned with:

carlosrodlop commented 1 month ago

It addresses https://github.com/cloudbees/terraform-aws-cloudbees-ci-eks-addon/issues/51

carlosrodlop commented 1 month ago

51 alone is not enough it requires to ensure that node groups using `gp3` at least deploy one node in the same Az as defined in the topology constraint for the SC

carlosrodlop commented 1 month ago

Idea: Node Group using Gp3 as Storage Class, divide into 2 different node group including for one of them subnet_ids same to the GP3 SC topology. For example cb_apps = cb_apps_aza + cb_azbc Ref: https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/eks-managed-node-group?tab=inputs

carlosrodlop commented 2 weeks ago

[Node Group, single AZ] seems possible https://tanmay-bhat.medium.com/how-to-migrate-a-node-group-from-multi-az-to-single-az-in-aws-eks-73b0dc553ed. But it would be more interesting to ensure autoscaler does not delete nodes from a particular AZ (the AZ you constraint to be your EBS controllers) and share the node pools for EBS and EFS controllers https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/eks-managed-node-group?tab=inputs ==> placement_group_az

cloudbees-oss / terraform-aws-cloudbees-ci-eks-addon

[Blueprints, 02-at-scale]: Team-b volume node affinity conflict after recovering from Hibernation #195

Description

⚠️ Note

Versions

Reproduction Code [Required]

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

51 alone is not enough it requires to ensure that node groups using `gp3` at least deploy one node in the same Az as defined in the topology constraint for the SC

cloudbees-oss / terraform-aws-cloudbees-ci-eks-addon

[Blueprints, 02-at-scale]: Team-b volume node affinity conflict after recovering from Hibernation #195

Description

⚠️ Note

Versions

Reproduction Code [Required]

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

51 alone is not enough it requires to ensure that node groups using gp3 at least deploy one node in the same Az as defined in the topology constraint for the SC

51 alone is not enough it requires to ensure that node groups using `gp3` at least deploy one node in the same Az as defined in the topology constraint for the SC