Open jimleroyer opened 11 months ago
Opened PR for the TF part: https://github.com/cds-snc/notification-terraform/pull/976
Yesterday we merged the TF part in staging environment. Today, I will look at the application's code to lower the retry period.
I'm going to run a perf test against staging to test how this behaves.
I rate limited myself during testing last week. We should re-run this week.
Note: This is email related but the same results would likely hold for SMS
Tested with:
After bulk job done and pods scaled down, got 300+ second delays on a few priority emails
Ran a second time with similar results.
Then adjusted the retry to 26 sec to match the SMS change from last week.
Ran the test 2 times. Both time had no large delays during scale down (once a 10 sec delay, once a 1 sec delay)
Summary: did 4 tests of restarting send-sms deployment while running the priority soak test timeout 310 : 265 sec delay timeout 26: no delay timeout 26: 15 sec delay timeout 310: 295 sec delay
PR to set to 26 sec on prod https://github.com/cds-snc/notification-terraform/pull/986
So on staging, with 26 timeout set the sms pods to min 3 max 20, scaling threshold 25% CPU
will release this configuration to prod tomorrow: https://github.com/cds-snc/notification-manifests/pull/2113
Jimmy made existing tests work for the application code change and today will expand on the tests around the lowered retry time.
Will run a bigger test (30min+ bulk send) on staging before release
Bigger test went as expected Private Zenhub Image
new SMS pod scaling deployed and tested in prod :tada:
Jimmy's PR ready for review. Steve to review.
Reviewed, LGTM
I reconfigured the devcontainer to be ran exclusively with the vscode
to fix local issues with admin and help me with ongoing test of reducing retry period in the Celery task for sms sending.
After some local testing and seeing my changes wouldn't work, I rewrote my PR to make it work. Some tests are not passing now after these new changes so I am trying to fix these ATM.
We deployed the change to lower Celery task retries for SMS high priority yesterday in staging.
I also made a CW query to monitor how the retries are performing: https://gcdigital.slack.com/archives/C012W5K734Y/p1699365094164439
We need to turn the feature flag on in production to do the final QA. @jimleroyer to confirm/action.
I turned the feature flag on in production for the Celery task set to 25 seconds. It is merged and meant to be deployed today. We'll monitor the feature in production for a few days. I need to carry the CW LogInsights query from staging to production for monitoring.
It went to production last week with the feature flag enabled, for the Celery retry task part (not the SMS SQS visibility timeout part).
Next step is to monitor the timeout via a query we wrote in staging, I put it in TF to move it in production next. The PR also consolidates all Celery related queries from both staging and production environments together: https://github.com/cds-snc/notification-terraform/pull/1035
For the past week, no retries at 25 seconds were made, which is the retry time for failing high priority sms tasks. Private Zenhub Image
Will leave for another week to see what's in the logs.
Steve will look at queries in staging / prod.
@sastels is naughty and forgot, he will do it today!
I will definitely look at these today (hopefully!)
I will definitely look at these today (hopefully!)
reraun the queries, no 25 second retries
As we do not see instances of 25 second retries, but this was thoroughly tested locally with triggered exception and we know it's working (and production is still working as expected), we'll move this done.
Description
As a user of GCNotify, I want high priority SMS notifications to be sent within 1 minute, So that I can rely on the product and send my messages quick enough to users.
As an ops lead of GCNotify, I want high priority SMS notifications to be sent within 1 minute, So the alarms does not trigger as a SLO violation.
WHY are we building?
WHAT are we building?
Reduce the retry period of the high priority SMS notifications, because the retry currently kicks off at 5 minutes after the initial try, which already got past the SLO 99% of 1 minute (20 seconds at 90%).
VALUE created by our solution
Acceptance Criteria
QA Steps
Additional information
There are two areas to make the potential changes: