cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

AWS Private Subnets are too small for EKS cluster #322

Open ben851 opened 2 months ago

ben851 commented 2 months ago

Description

As a developer of Notify, I would like our system to be able to accommodate scaling up in the future so that we can grow without having to rearchitect our infrastructure.

Currently the private subnets for the EKS nodes are /24 meaning we only have room for 255 IPs. We have received a warning (below) from AWS stating that we are running out of IPs in production, and that there may be service interruptions when they do patches.

WHY are we building?

We received a warning from AWS that we are running out of IPs in production

WHAT are we building?

  1. Create 3 new subnets with /19 prefix to allow for 8191 IP addresses
  2. Create a new EKS node group that deploys to these new subnets
  3. kubectl cordon/drain the old nodes
  4. Delete the old nodes
  5. Delete the old subnets

VALUE created by our solution

Increased reliability when patching. Allows for scaling up the system more

Acceptance Criteria

QA Steps

Appendix

Amazon EKS detected cluster health issues in your AWS account 296255494825.

The following is a list of affected clusters with their cluster arns, cluster health status and corresponding cluster health issues(s): arn:aws:eks:ca-central-1:296255494825:cluster/notification-canada-ca-production-eks-cluster : IMPAIRED : Not Enough Free IP Addresses In Subnet.

The health of an EKS cluster is a shared responsibility between AWS and customers. You must resolve these issues to maintain operational stability for your EKS cluster(s). Cluster health issues can prevent Amazon EKS from patching your clusters or prevent you from upgrading to newer Kubernetes versions.

Starting on 2024-04-15, Amazon EKS will patch clusters to the latest supported platform version [1]. Clusters that are unstable due to outstanding health issues may experience loss in connectivity between the Kubernetes control plane instances and worker nodes where your workload runs. To avoid this, we recommend that you resolve outstanding cluster health issues [2] before this date.

You can also view your affected clusters in the 'Affected resources' tab in your AWS Health Dashboard or by using the DescribeCluster API [3].

[1] https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html [2] https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#cluster-health-status

ben851 commented 2 months ago

PR is ready

https://github.com/cds-snc/notification-terraform/pull/1240

ben851 commented 2 months ago

Merged to staging

ben851 commented 2 months ago

Merged to prod, migrated workload to the new nodes. Need to submit a PR for old node removal

ben851 commented 2 months ago

New PR created https://github.com/cds-snc/notification-terraform/pull/1245

ben851 commented 2 months ago

New nodes were created in production.

Old nodes were deleted in staging this morning, will do a release to complete this in prod today.

sastels commented 2 months ago

node groups as expected:

$ aws eks list-nodegroups --cluster-name notification-canada-ca-production-eks-cluster

{
    "nodegroups": [
        "notification-canada-ca-production-eks-primary-node-group-k8s"
    ]
}
sastels commented 2 months ago

Subnets verified with aws ec2 describe-subnets.