Overview
In a recent incident, we were not aware the Check Service was down as the error had been logged a few weeks before (but not brought the whole Service down). This meant that we didn’t get a Sentry alert on the Slack channel to notify us of the problem.
To remedy, we should alter the Sentry notification threshold to be more persistent for errors where level=fatal. For fatal errors, we should have a more persistent alarm if the error reoccurs after its initial reporting - perhaps if it has been over 24 hours since the initial notification and the issue occurs again.
If possible, it would also be good if fatal alerts could be more prominent in Slack - so they are noticeably visually distinct from other Sentry issues (which generally are saved until the fortnightly review call for in-depth investigation).
Pull Request(PR):
Tech Approach
A bullet pointed list with details on how this could be technically worked.
Include links to relevant webpages, github files, user guides etc...
Acceptance Criteria/Tests
Sentry alerts modified for fatal errors so that they renotify Slack if they reoccur more than 24 hours after the previous notification
Investigation into whether fatal level errors can be more visually distinct
Infrastructure and Providers informed of any changes.
Resourcing & Dependencies
Need to ensure Providers are aware of any changes made and told when implemented
Overview In a recent incident, we were not aware the Check Service was down as the error had been logged a few weeks before (but not brought the whole Service down). This meant that we didn’t get a Sentry alert on the Slack channel to notify us of the problem.
To remedy, we should alter the Sentry notification threshold to be more persistent for errors where level=fatal. For fatal errors, we should have a more persistent alarm if the error reoccurs after its initial reporting - perhaps if it has been over 24 hours since the initial notification and the issue occurs again.
If possible, it would also be good if fatal alerts could be more prominent in Slack - so they are noticeably visually distinct from other Sentry issues (which generally are saved until the fortnightly review call for in-depth investigation).
Pull Request(PR):
Tech Approach A bullet pointed list with details on how this could be technically worked.
Include links to relevant webpages, github files, user guides etc...
Acceptance Criteria/Tests
Sentry alerts modified for fatal errors so that they renotify Slack if they reoccur more than 24 hours after the previous notification
Investigation into whether fatal level errors can be more visually distinct
Infrastructure and Providers informed of any changes.
Resourcing & Dependencies