cds-snc / notification-planning-core

Project planning for GC Notify Core Team

0 stars 0 forks source link

Test and fix the in-flight expired alarms #136

Open jimleroyer opened 1 year ago

jimleroyer commented 1 year ago

Describe the bug

The expired in-flight alarms didn't trigger when going through a recent incident, i.e. 2023-06-01-notifications-getting-errors-repeatedly.

Bug Severity

SEV-2 Major

To Reproduce

Steps to reproduce the behavior:

Block a redis queue with a faulty notification (such as long content or bad signature).
Wait over many 5 minutes iteration for the notification to flip over the inbox, the in-flight and back to the inbox, numerous times.
Wait for the alarm to trigger based on its configuration (but it should not trigger), [such as this one](See examples in the documentation).

Expected behavior

The alarm should trigger.

Impact

We do not know when our Redis queues are filled up and stop getting processed.

Impact on Notify users: Potential delay of notifications or no sending.

Impact on Recipients: Potential delay of notifications or no reception.

Impact on Notify team: Less awareness on issues and more difficult to troubleshoot.

Additional context

This behavior was observed during the incident 2023-06-01-notifications-getting-errors-repeatedly.

sastels commented 1 year ago

We don't seem to have an explicit alarm around inflights being expired.

sastels commented 1 year ago

Ok, I was wrong we have 4 alarms for expired inflghts! So why weren't these triggered? :thinking: https://github.com/cds-snc/notification-terraform/blob/main/aws/common/cloudwatch_alarms.tf#L1345-L1461

expired-sms-created-warning
expired-sms-created-critical
expired-email-created-warning
expired-email-created-critical

sastels commented 1 year ago

well, first of all the warning / critical thresholds are set to 100 and 200 (over 5 minute period) which is crazy high for something that only happens when something is going wrong.

However I still don't see any values for that metric during the incident. Investigating more...

sastels commented 1 year ago

ok, the issue seems to be that when we create the alarm metric we don't set the priority. I think the idea was to have the metric contain all priorities. But when we set the priority to "bulk" we see the expired inflights in the metric. Below the blue metric is the alarm metric (which has no data) and the orange one is where I've restricted to priority: bulk.

sastels commented 1 year ago

From https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Dimension we can't just omit a dimension and get all the data where the other dimensions are set. So we have to either create separate alarms for the different priorities, or possibly use some sort of "search" to find all the dimensions

sastels commented 1 year ago

The fancy AWS queries over multiple dimensions only work on the last 3 hours of data. Not optimal. However, another recommended approach is to add a new value for the priority: "any" and save to that metric any time you save to any priority.

So we have 3 options, essentially:

make an alarm for every dimension value. Essentially make 3 alarms (one over each priority) for each of the 4 we already have
fix what we have now: an all-priorities "email" and an all-priorities "sms"
make one global expired warning / critical

We have so few expired inflights that I lean towards the last. Just one warning and one critical that directs you to the batch saving dashboard where everything is broken out.

sastels commented 1 year ago

How to create an expired inflight

sastels commented 1 year ago

PR to add metric values for any notification type or priority

After this merges can fix the expired in-flights alarms in terraform

ben851 commented 1 year ago

@Jimmy to QA

jimleroyer commented 1 year ago

Tested the inflight warning alarm by poisoning the bulk Redis queue with a badly signed notification. The alarm triggered after 5 minutes as expected. I checked the dashboard afterward and it showed as is: