Implement alarm monitoring 2XX:5XX proportion

Ben-Hodgkiss commented 1 month ago

Overview In a recent incident, we were not aware the Check Service was down as the main canary was still pinging successfully. To mitigate this happening again, we need to implement a new alarm. This alarm should measure the proportion of 2XX:5XX responses every 5 minutes on both the Platform and Check Service’s CDN logs. If the % of 5XX errors is over 20%, it should raise an alarm.

Pull Request(PR):

Tech Approach A bullet pointed list with details on how this could be technically worked.

Include links to relevant webpages, github files, user guides etc...
Person raising the ticket to have a first pass at this (if they know the approach) -Tech Lead will then review and bring to refinement session.

Acceptance Criteria/Tests

Monitoring set up to check % of 5XX responses every 5 minutes.
Alarm set up so that if % > 20%, alarm sounds via the Platform Slack channel.
% should be easily configurable so we can change if we realise that we are missing incidents or it is generating false positives.
@Ben-Hodgkiss, @eveleighoj, @DilwoarH and @CharliePatterson are notified that this has been implemented.

Resourcing & Dependencies

May require AWS access?
Need to ensure Providers are aware of any changes made and told when implemented

cpcundill commented 3 days ago

PR: https://github.com/digital-land/digital-land-infrastructure/pull/160

cpcundill commented 3 days ago

@Ben-Hodgkiss , @eveleighoj, @DilwoarH and @CharliePatterson The alarms have now been rolled out to all environments - dev, staging and prod.

digital-land / technical-documentation

Implement alarm monitoring 2XX:5XX proportion #72