cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Lower the retry period for celery email sending tasks #228

Open ben851 opened 7 months ago

ben851 commented 7 months ago

Description

As a user of GCNotify, I want high priority email notifications to be sent within 1 minute, So that I can rely on the product and send my messages quick enough to users.

As an ops lead of GCNotify, I want high priority email notifications to be sent within 1 minute, So the alarms does not trigger as a SLO violation.

WHY are we building?

WHAT are we building?

Reduce the retry period of the high priority email notifications, because the retry currently kicks off at 5 minutes after the initial try, which already got past the SLO 99% of 1 minute (20 seconds at 90%).

VALUE created by our solution

Acceptance Criteria

QA Steps

  1. Deploy to staging
  2. Activate HPA of K8s pods and with Karpenter if possible to add more scalability.
  3. Sends intermittent batches of notifications that trigger the scaling up and down. The down part is especially important as this is when the messages in SQS got their visibility reset when Celery workers go down along with the corresponding Kubernetes pods.
  4. Confirm and track notifications that were retried within the minute during these period, where it would retry after 5 minutes afterward.

Additional information

There are two areas to make the potential changes:

  1. The code sites where we pick the delivery task as the last Celery task prior to the sending of the notification, such as this one. We need to override the task retry parameter for only when the notification type is email and the queue is the high priority one. We do not want to override the retry period globally for all queues but rather keep the 5 minutes default.
  2. Tweak the visibility timeout of the email high priority queue by adding an explicit configuration for it in Terraform as we've done for other existing queues such as this one example.
jimleroyer commented 7 months ago

Started a draft PR, in a refactoring+polishing phase at the moment. https://github.com/cds-snc/notification-api/pull/2031/files

jimleroyer commented 7 months ago

I worked on the changes yesterday by bringing refactoring on the retry handling for emails and started fixing broken tests.

sastels commented 7 months ago

fixing some circular dependancies that arose.

ben851 commented 7 months ago

Jimmy fixed these, but is off today. Will follow up later this week

jimleroyer commented 7 months ago

Most errors fixed from existing tests but there is an edge case that still cause some issues.

sastels commented 6 months ago

test errors fixed. Almost ready for review, will probably add a few more tests.

jimleroyer commented 6 months ago

I moved this into review with this PR: https://github.com/cds-snc/notification-api/pull/2031

sastels commented 6 months ago

Steve will take a look!

sastels commented 6 months ago

looked and added comment / question

jimleroyer commented 6 months ago

This was merged today in staging, after PR approval.

ben851 commented 5 months ago

Ben to verify if this is in prod and will move to done if so.

ben851 commented 5 months ago

Merged several weeks ago and we've released to prod since. Looks good.