Open jimleroyer opened 1 year ago
Have a preliminary PR to go into staging
Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience.
Deployed karpenter in staging, had some issues with initial deploy w/ kustomize. Need to go back to scratch account to better improve the install experience.
The following command must be run in prod before merging karpenter to production: aws iam create-service-linked-role --aws-service-name spot.amazonaws.com
Above is no longer true, I added the correct TF resources into common.
deployed in staging and working! will continue to test with the scaling work.
Ben is optimizing the scale down configuration. The Celery worker sigterm might be fixed as well, we'll confirm with later tests.
Karpenter is working "OK" with the non-reconciliation timeout. It is not as efficient as it could be but it's good enough for V1. Will aim for a release this week.
Just a thought - verify what's up with staging before prod release
Put in a PR to address the 502s in staging. This will not affect the karpenter only prod release, since it is the k8s api that is putting these warnings out.
Karpenter can be released to prod today for testing and verification
Ben will update the ADR on how to stabilize the deployments while using ephemeral nodes.
PR to fix celery socket errors https://github.com/cds-snc/notification-api/pull/1996
ready for a (hopefuly!) final test today
We're still getting 502s in staging so we need to look into it. The pod destruction budget configuration for the API in Karpenter doesn't seem to hold.
Going to remove API from spot instances. Going to attempt a test in staging where we disable cloudwatch and restart celery to verify whether or not the celery cwagent init script works.
Staging test was successful. The celery pods did not spin up until CWAgent was ready. Created a PR to re-enable celery on karpenter in production
Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this.
Found a scenario in prod where cwagent was not spinning up because the node had insufficient CPU. The celery pods were reporting that they were ready, but they weren't, because they were stuck waiting for cwagent. Need to add a couple patches to fix this.
Implemented a fix in staging to prioritize the cwagent, and also tuned the liveness and readiness probes in staging. This is working but may be a bit too aggressive. I've opened a new PR to increase the delay times.
Reverted the probes because they were not working. Will move this to review to monitor the system for a week.
Steve ran some tests with the visibility timeout in staging, and reducing the timeout to 26 seconds. Will look into releasing this soon.
CWAgent seems happier now.
almost ready to go. waiting on Steve to merge his PR
CWAgent OOMing :/ Might have a fix? Karpenter spot instances restart every day, so if CWAgent lasts for 24 hours we should be good.
PR For alarm created
I've been looking into alerts and alarms based on this, and I there doesn't seem to be a great way to create an alarm based off of karpenter logs. The error about being unable to provision nodes seems to occur daily, but resolves in such a short amount of time that it doesn't have any effect.
Looking further, we already have alarms that celery has unavailable replicas which will trigger if karpenter is having issues. This should be sufficient.
I'm gonna create an alarm for when karpenter itself is not running
Alarm created, moving to review
Description
As a developer/operator of GC Notify, I would like the system to be able to scale kubernetes nodes based off of load so that we are not constantly running with the maximum number of nodes when they are idle for the most part.
WHY are we building?
We are pushing changes to Notify that will increase our sending rate to meet OKRs. To accommodate this, we must increase the number of nodes available in Kubernetes. Since we are only using these nodes during peak periods, they are wasted for the most part. This increases costs with no additional benefit, and it would be good to be able scale these nodes on demand to maximize cost efficiency.
WHAT are we building?
There are two methods of autoscaling EKS - the built in Kubernetes functionality or using Karpenter. Karpenter is more flexible and allows us to take advantage of spot pricing on amazon to further maximize cost efficiency.
We will install and configure Karpenter in notify's EKS cluster.
VALUE created by our solution
Acceptance Criteria
QA Steps