jimleroyer commented 1 year ago

Description

As a developer/operator of GC Notify, I would like the system to be able to scale kubernetes nodes based off of load so that we are not constantly running with the maximum number of nodes when they are idle for the most part.

WHY are we building?

We are pushing changes to Notify that will increase our sending rate to meet OKRs. To accommodate this, we must increase the number of nodes available in Kubernetes. Since we are only using these nodes during peak periods, they are wasted for the most part. This increases costs with no additional benefit, and it would be good to be able scale these nodes on demand to maximize cost efficiency.

WHAT are we building?

There are two methods of autoscaling EKS - the built in Kubernetes functionality or using Karpenter. Karpenter is more flexible and allows us to take advantage of spot pricing on amazon to further maximize cost efficiency.

We will install and configure Karpenter in notify's EKS cluster.

VALUE created by our solution

Reduced cost even when not using the burst features since we will be able to reduce the minimum node count in the main cluster
Increased performance since we will be able to accommodate a higher number of celery pods during burst periods
Easier to scale up in the future

Acceptance Criteria

[x] Karpenter is installed and working in staging and production
[x] Karpenter is configured to only use the approved node sizes
[x] Predictable ramp up curves that do not impact the immediate performance of notify
[x] We have node selectors for non-celery pods to ensure they are not deployed to ephemeral nodes
[x] alarm created for when karpenter cannot provision new nodes

QA Steps

[x] Run performance tests that test the node scaling curves to ensure the system is not negatively impacted
[x] Verify that nodes do scale up and down as expected
[x] Verify that non celery pods do not get deployed to the ephemeral nodes

sastels commented 1 year ago

Have a preliminary PR to go into staging

ben851 commented 1 year ago

Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience.

ben851 commented 1 year ago

Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience.

ben851 commented 1 year ago

The following command must be run in prod before merging karpenter to production: aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

ben851 commented 1 year ago

Above is no longer true, I added the correct TF resources into common.

sastels commented 1 year ago

deployed in staging and working! will continue to test with the scaling work.

jimleroyer commented 1 year ago

Ben is optimizing the scale down configuration. The Celery worker sigterm might be fixed as well, we'll confirm with later tests.

ben851 commented 1 year ago

Karpenter is working "OK" with the non-reconciliation timeout. It is not as efficient as it could be but it's good enough for V1. Will aim for a release this week.

ben851 commented 1 year ago

Just a thought - verify what's up with staging before prod release

ben851 commented 12 months ago

Put in a PR to address the 502s in staging. This will not affect the karpenter only prod release, since it is the k8s api that is putting these warnings out.

Karpenter can be released to prod today for testing and verification

ben851 commented 12 months ago

Ben will update the ADR on how to stabilize the deployments while using ephemeral nodes.

sastels commented 11 months ago

PR to fix celery socket errors https://github.com/cds-snc/notification-api/pull/1996

sastels commented 11 months ago

ready for a (hopefuly!) final test today

jimleroyer commented 11 months ago

We're still getting 502s in staging so we need to look into it. The pod destruction budget configuration for the API in Karpenter doesn't seem to hold.

ben851 commented 11 months ago

Going to remove API from spot instances. Going to attempt a test in staging where we disable cloudwatch and restart celery to verify whether or not the celery cwagent init script works.

ben851 commented 11 months ago

Staging test was successful. The celery pods did not spin up until CWAgent was ready. Created a PR to re-enable celery on karpenter in production

ben851 commented 11 months ago

Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this.

ben851 commented 11 months ago

Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this.

ben851 commented 11 months ago

Implemented a fix in staging to prioritize the cwagent, and also tuned the liveness and readiness probes in staging. This is working but may be a bit too aggressive. I've opened a new PR to increase the delay times.

ben851 commented 11 months ago

Reverted the probes because they were not working. Will move this to review to monitor the system for a week.

ben851 commented 11 months ago

Steve ran some tests with the visibility timeout in staging, and reducing the timeout to 26 seconds. Will look into releasing this soon.

sastels commented 11 months ago

CWAgent seems happier now.

sastels commented 11 months ago

almost ready to go. waiting on Steve to merge his PR

sastels commented 11 months ago

CWAgent OOMing :/ Might have a fix? Karpenter spot instances restart every day, so if CWAgent lasts for 24 hours we should be good.

ben851 commented 11 months ago

PR For alarm created

ben851 commented 10 months ago

I've been looking into alerts and alarms based on this, and I there doesn't seem to be a great way to create an alarm based off of karpenter logs. The error about being unable to provision nodes seems to occur daily, but resolves in such a short amount of time that it doesn't have any effect.

Looking further, we already have alarms that celery has unavailable replicas which will trigger if karpenter is having issues. This should be sufficient.

ben851 commented 10 months ago

I'm gonna create an alarm for when karpenter itself is not running

ben851 commented 10 months ago

Alarm created, moving to review

cds-snc / notification-planning-core

Improve node auto-scaling for Kubernetes #184

Description

WHY are we building?

WHAT are we building?

VALUE created by our solution

Acceptance Criteria

QA Steps