As a developer/operator of GC Notify, I would like to only be alerted when there are actual issues with our system, and not during false alarms so that I do not get alert fatigue and am able to quickly identify real errors.
Looking at metrics for time periods when the alarm goes off, the alarms accurately coincide with large bulk sends. I'm not sure there's a lot we can do to address this in the short term other than modifying the alarm to be less sensitive.
In the long term, we will need to investigate how to get better throughput
Spoke with Jimmy, and we agreed that the only real solution to this is to begin focusing on work to improve the throughput of the system. I've created a new card for that work.
Description
As a developer/operator of GC Notify, I would like to only be alerted when there are actual issues with our system, and not during false alarms so that I do not get alert fatigue and am able to quickly identify real errors.
This card covers the following alerts in the alarm review spreadsheet
WHY are we building?
We are receiving a lot of noise in our operations slack channel that are not indicative of actual issues.
WHAT are we building?
Investigate the bulk SQS queue and determine if they can be fixed or if the alarm needs adjustment
VALUE created by our solution
Fewer false alarms will increase developer agility and response times to actual issues.
Acceptance Criteria
QA Steps