digital-land / technical-documentation

Technical Documentation for the planning data service.
https://digital-land.github.io/technical-documentation/index.html
0 stars 0 forks source link

Implement alarm monitoring 2XX:5XX proportion #72

Closed Ben-Hodgkiss closed 3 days ago

Ben-Hodgkiss commented 1 month ago

Overview In a recent incident, we were not aware the Check Service was down as the main canary was still pinging successfully. To mitigate this happening again, we need to implement a new alarm. This alarm should measure the proportion of 2XX:5XX responses every 5 minutes on both the Platform and Check Service’s CDN logs. If the % of 5XX errors is over 20%, it should raise an alarm.

Pull Request(PR):

Tech Approach A bullet pointed list with details on how this could be technically worked.

Acceptance Criteria/Tests

Resourcing & Dependencies

cpcundill commented 3 days ago

PR: https://github.com/digital-land/digital-land-infrastructure/pull/160

cpcundill commented 3 days ago

@Ben-Hodgkiss , @eveleighoj, @DilwoarH and @CharliePatterson The alarms have now been rolled out to all environments - dev, staging and prod.