jimleroyer commented 11 months ago

Description

As a user of GCNotify, I want high priority SMS notifications to be sent within 1 minute, So that I can rely on the product and send my messages quick enough to users.

As an ops lead of GCNotify, I want high priority SMS notifications to be sent within 1 minute, So the alarms does not trigger as a SLO violation.

WHY are we building?

Honor our SLOs.
Have a reliable product.
Reduce alarms trigger based on our alarms.

WHAT are we building?

Reduce the retry period of the high priority SMS notifications, because the retry currently kicks off at 5 minutes after the initial try, which already got past the SLO 99% of 1 minute (20 seconds at 90%).

VALUE created by our solution

Reliability
Sleep at night

Acceptance Criteria

The SMS notifications are retried within the minute they initially fail.
The SLO based alarm on SMS is reduced by 80%.

QA Steps

Deploy to staging
Activate HPA of K8s pods and with Karpenter if possible to add more scalability.
Sends intermittent batches of notifications that trigger the scaling up and down. The down part is especially important as this is when the messages in SQS got their visibility reset when Celery workers go down along with the corresponding Kubernetes pods.
Confirm and track notifications that were retried within the minute during these period, where it would retry after 5 minutes afterward.

Additional information

There are two areas to make the potential changes:

The code sites where we pick the delivery task as the last Celery task prior to the sending of the notification, such as this one. We need to override the task retry parameter for only when the notification type is SMS and the queue is the high priority one. We do not want to override the retry period globally for all queues but rather keep the 5 minutes default.
Tweak the visibility timeout of the SMS high priority queue by adding an explicit configuration for it in Terraform as we've done for other existing queues such as this one example.

jimleroyer commented 11 months ago

Opened PR for the TF part: https://github.com/cds-snc/notification-terraform/pull/976

jimleroyer commented 11 months ago

Yesterday we merged the TF part in staging environment. Today, I will look at the application's code to lower the retry period.

ben851 commented 11 months ago

I'm going to run a perf test against staging to test how this behaves.

ben851 commented 11 months ago

I rate limited myself during testing last week. We should re-run this week.

sastels commented 11 months ago

Note: This is email related but the same results would likely hold for SMS

Tested with:

soak test of one priority email / second
upload file of 10K bulk emails

After bulk job done and pods scaled down, got 300+ second delays on a few priority emails

Ran a second time with similar results.

Then adjusted the retry to 26 sec to match the SMS change from last week.

Ran the test 2 times. Both time had no large delays during scale down (once a 10 sec delay, once a 1 sec delay)

sastels commented 11 months ago

Summary: did 4 tests of restarting send-sms deployment while running the priority soak test timeout 310 : 265 sec delay timeout 26: no delay timeout 26: 15 sec delay timeout 310: 295 sec delay

PR to set to 26 sec on prod https://github.com/cds-snc/notification-terraform/pull/986

sastels commented 11 months ago

So on staging, with 26 timeout set the sms pods to min 3 max 20, scaling threshold 25% CPU

ran soak test and a 6K bulk
pods scaled up to 20, back down to 4, no large delays
sustained send rate around 1000 sms / min (note that this test is not doing the sms boto call so real results may vary)

will release this configuration to prod tomorrow: https://github.com/cds-snc/notification-manifests/pull/2113

sastels commented 11 months ago

Jimmy made existing tests work for the application code change and today will expand on the tests around the lowered retry time.

sastels commented 11 months ago

Will run a bigger test (30min+ bulk send) on staging before release

sastels commented 11 months ago

Bigger test went as expected Private Zenhub Image

sastels commented 11 months ago

new SMS pod scaling deployed and tested in prod :tada:

sastels commented 11 months ago

Jimmy's PR ready for review. Steve to review.

sastels commented 11 months ago

Reviewed, LGTM

jimleroyer commented 11 months ago

I reconfigured the devcontainer to be ran exclusively with the vscode to fix local issues with admin and help me with ongoing test of reducing retry period in the Celery task for sms sending.

jimleroyer commented 11 months ago

After some local testing and seeing my changes wouldn't work, I rewrote my PR to make it work. Some tests are not passing now after these new changes so I am trying to fix these ATM.

jimleroyer commented 11 months ago

We deployed the change to lower Celery task retries for SMS high priority yesterday in staging.

I also made a CW query to monitor how the retries are performing: https://gcdigital.slack.com/archives/C012W5K734Y/p1699365094164439

Private Zenhub Image

ben851 commented 10 months ago

We need to turn the feature flag on in production to do the final QA. @jimleroyer to confirm/action.

sastels commented 10 months ago

ran sms soak / load test on staging (1 high priority sms / sec and 18000 low priority sms)
send-sms-high queue delays stayed low

Private Zenhub Image

jimleroyer commented 10 months ago

I turned the feature flag on in production for the Celery task set to 25 seconds. It is merged and meant to be deployed today. We'll monitor the feature in production for a few days. I need to carry the CW LogInsights query from staging to production for monitoring.

jimleroyer commented 10 months ago

It went to production last week with the feature flag enabled, for the Celery retry task part (not the SMS SQS visibility timeout part).

Next step is to monitor the timeout via a query we wrote in staging, I put it in TF to move it in production next. The PR also consolidates all Celery related queries from both staging and production environments together: https://github.com/cds-snc/notification-terraform/pull/1035

jimleroyer commented 10 months ago

For the past week, no retries at 25 seconds were made, which is the retry time for failing high priority sms tasks. Private Zenhub Image

sastels commented 10 months ago

Will leave for another week to see what's in the logs.

sastels commented 10 months ago

Steve will look at queries in staging / prod.

ben851 commented 10 months ago

@sastels is naughty and forgot, he will do it today!

sastels commented 10 months ago

I will definitely look at these today (hopefully!)

sastels commented 10 months ago

I will definitely look at these today (hopefully!)

sastels commented 10 months ago

reraun the queries, no 25 second retries

jimleroyer commented 10 months ago

As we do not see instances of 25 second retries, but this was thoroughly tested locally with triggered exception and we know it's working (and production is still working as expected), we'll move this done.

cds-snc / notification-planning-core

Increase the retry rate for high priority SMS notifications #211

Description

WHY are we building?

WHAT are we building?

VALUE created by our solution

Acceptance Criteria

QA Steps

Additional information