Overview
In a recent incident, we were not aware the Check Service was down as the main canary was still pinging successfully. To mitigate this happening again, we need to implement a new alarm. This alarm should measure the proportion of 2XX:5XX responses every 5 minutes on both the Platform and Check Service’s CDN logs. If the % of 5XX errors is over 20%, it should raise an alarm.
Pull Request(PR):
Tech Approach
A bullet pointed list with details on how this could be technically worked.
Include links to relevant webpages, github files, user guides etc...
Person raising the ticket to have a first pass at this (if they know the approach) -Tech Lead will then review and bring to refinement session.
Acceptance Criteria/Tests
Monitoring set up to check % of 5XX responses every 5 minutes.
Alarm set up so that if % > 20%, alarm sounds via the Platform Slack channel.
% should be easily configurable so we can change if we realise that we are missing incidents or it is generating false positives.
@Ben-Hodgkiss, @eveleighoj, @DilwoarH and @CharliePatterson are notified that this has been implemented.
Resourcing & Dependencies
May require AWS access?
Need to ensure Providers are aware of any changes made and told when implemented
Overview In a recent incident, we were not aware the Check Service was down as the main canary was still pinging successfully. To mitigate this happening again, we need to implement a new alarm. This alarm should measure the proportion of 2XX:5XX responses every 5 minutes on both the Platform and Check Service’s CDN logs. If the % of 5XX errors is over 20%, it should raise an alarm.
Pull Request(PR):
Tech Approach A bullet pointed list with details on how this could be technically worked.
Acceptance Criteria/Tests
Resourcing & Dependencies