Open ben851 opened 7 months ago
Debugging staging - have an intermittent issue with document download pods not being scheduled during nightly perf test. Will continue on this today.
We reworked resources on primary nodes to solve an issue with DD-API that was getting rebooted (due to a lack of resources).
PR opened to move this to prod but staging has been super unstable (pretty sure unrelated to these changes) so we're holding off until things calm down.
Merged to prod, karpenter won't spin up nodes while we're at 8 primary nodes. We will let the system run as is over the weekend and then look at a node reduction to fully implement on Monday
@ben851 can we move this to QA?
Not quite - we need one more release on Monday to lower the nodes before we can QA it
This has been done, Ben to come up with QA steps
We had a scenario yesterday in staging where all email pods were on spot instances even though we said don't do that. I have created a new solution with double deployments. Merged to staging, Steve will update the dashboards, Ben has a PR incoming to update the alarms (in dev testing right now)
This is in staging, Steve ran rollercoaster tests. The environment was stable even thought scalable pods went down to 0 during node switches...
This should be released to prod today, will require coordination with OL
Steve will QA
Description
As a developer/operator of notify, I want the celery batch processing to autoscale without causing any outages.
WHY are we building?
When celery is deployed to spot instances only, it occasionally drops to 0 pods for a minute or two, causing momentary outages.
WHAT are we building?
Create primary and scalable deployments in k8s manifests for celery. Primary deployments do not scale and are always running on the ON_DEMAND nodes. Scalable only runs on spot instances
VALUE created by our solution
Increase notify performance and reliability
Acceptance Criteria
QA Steps