cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Reduce the SQS visibility timeout for High Priority Emails #227

Open jimleroyer opened 7 months ago

jimleroyer commented 7 months ago

Description

As a user of GCNotify, I want high priority email notifications to be sent within 1 minute, So that I can rely on the product and send my messages quick enough to users.

As an ops lead of GCNotify, I want high priority email notifications to be sent within 1 minute, So the alarms does not trigger as a SLO violation.

WHY are we building?

WHAT are we building?

Reduce the retry period of the high priority email notifications, because the retry currently kicks off at 5 minutes after the initial try, which already got past the SLO 99% of 1 minute (20 seconds at 90%).

VALUE created by our solution

Acceptance Criteria

QA Steps

  1. Deploy to staging
  2. Activate HPA of K8s pods and with Karpenter if possible to add more scalability.
  3. Sends intermittent batches of notifications that trigger the scaling up and down. The down part is especially important as this is when the messages in SQS got their visibility reset when Celery workers go down along with the corresponding Kubernetes pods.
  4. Confirm and track notifications that were retried within the minute during these period, where it would retry after 5 minutes afterward.
  5. In production, monitor queue delays to ensure that we don't have emails not visible in queue while the queue delay is above 26 seconds.

Additional information

There are two areas to make the potential changes:

  1. The code sites where we pick the delivery task as the last Celery task prior to the sending of the notification, such as this one. We need to override the task retry parameter for only when the notification type is email and the queue is the high priority one. We do not want to override the retry period globally for all queues but rather keep the 5 minutes default.
  2. Tweak the visibility timeout of the email high priority queue by adding an explicit configuration for it in Terraform as we've done for other existing queues such as this one example.
sastels commented 7 months ago

For send-email-high queue visibility timeout: https://github.com/cds-snc/notification-terraform/pull/1029

ben851 commented 7 months ago

Need to run the import on this, which will be done today.

sastels commented 7 months ago

staging looks good. soak 1/s, bulk 2x50K Private Zenhub Image

sastels commented 7 months ago

In production. Ready for QA

sastels commented 7 months ago

Jimmy will look at high priority email queue delays for the past week.

jimleroyer commented 7 months ago

@jimleroyer will check this after stand up.

jimleroyer commented 7 months ago

Today I will, I promise

jimleroyer commented 7 months ago

Using Blazer query [SLO] / [Sending time] High priority send times and for the production environment.

BEFORE: From Nov 14 to Nov 21:

label   notification_count
<1s 37,205
<2s 5,980
<3s 1,310
<4s 342
<5s 129
<6s 58
<7s 22
<8s 20
<9s 21
<10s    14
<11s    15
<12s    19
<13s    18
<14s    17
<15s    10
<16s    3
<17s    69
<18s    66
<19s    22
<20s    15
<21s    8
<22s    8
<23s    7
<24s    9
<25s    6
<26s    3
<27s    68
<28s    9
<29s    1
<30s    2
<31s    2
<32s    2
<33s    3
<34s    1
<36s    1
<37s    1
<39s    1
<46s    1
<57s    1
<61s    1
<62s    1
<63s    2
<67s    2
<68s    1
<69s    1
<77s    1
<78s    1
<82s    1
<100s   1
<111s   1
<127s   1
<146s   1
<153s   1
<159s   1
<161s   1
<176s   1
<186s   1
<231s   1
<236s   1
<237s   2
<249s   1
<257s   1
<266s   1
<311s   1
<320s   1
<336s   1
<350s   1
<356s   1
<374s   1
<381s   1
<394s   1

AFTER: From Nov 21 to Nov 27:

label   notification_count
<1s 31,666
<2s 5,172
<3s 988
<4s 197
<5s 56
<6s 24
<7s 16
<8s 5
<9s 7
<10s    8
<11s    12
<12s    11
<13s    3
<14s    3
<15s    1
<16s    4
<17s    52
<18s    31
<19s    11
<20s    2
<21s    3
<27s    24
<28s    3
<29s    1
<30s    1
<33s    1
<34s    1
<35s    1

We got 2 anomalies though in the past 2 days. Given issues we had, I discarded these from previous AFTER report. Even then, the number is way less than before, however we should check in a while if we completely eliminated high priority notifications from taking more than 5 minutes.

label   notification_count
<302s   1
<16078s 1