cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Split celery between spot and primary instances #232

Open ben851 opened 7 months ago

ben851 commented 7 months ago

Description

As a developer/operator of notify, I want the celery batch processing to autoscale without causing any outages.

WHY are we building?

When celery is deployed to spot instances only, it occasionally drops to 0 pods for a minute or two, causing momentary outages.

WHAT are we building?

Create primary and scalable deployments in k8s manifests for celery. Primary deployments do not scale and are always running on the ON_DEMAND nodes. Scalable only runs on spot instances

VALUE created by our solution

Increase notify performance and reliability

Acceptance Criteria

QA Steps

ben851 commented 7 months ago

Debugging staging - have an intermittent issue with document download pods not being scheduled during nightly perf test. Will continue on this today.

jimleroyer commented 7 months ago

We reworked resources on primary nodes to solve an issue with DD-API that was getting rebooted (due to a lack of resources).

ben851 commented 7 months ago

PR opened to move this to prod but staging has been super unstable (pretty sure unrelated to these changes) so we're holding off until things calm down.

ben851 commented 7 months ago

Merged to prod, karpenter won't spin up nodes while we're at 8 primary nodes. We will let the system run as is over the weekend and then look at a node reduction to fully implement on Monday

sastels commented 7 months ago

@ben851 can we move this to QA?

ben851 commented 7 months ago

Not quite - we need one more release on Monday to lower the nodes before we can QA it

ben851 commented 7 months ago

This has been done, Ben to come up with QA steps

ben851 commented 6 months ago

We had a scenario yesterday in staging where all email pods were on spot instances even though we said don't do that. I have created a new solution with double deployments. Merged to staging, Steve will update the dashboards, Ben has a PR incoming to update the alarms (in dev testing right now)

ben851 commented 6 months ago

This is in staging, Steve ran rollercoaster tests. The environment was stable even thought scalable pods went down to 0 during node switches...

This should be released to prod today, will require coordination with OL

sastels commented 6 months ago

Steve will QA