Closed LindseySaari closed 7 months ago
This page documents new monitors: https://vfs.atlassian.net/wiki/spaces/PPT/pages/2966159668/Vets+API+Alert+and+Monitoring+Trends Add to it as needed.
As of today @rjohnson2011 has charts built out, we want to have an alert that sends when we are at 80% in Pager Duty. It might also make sense to include a storage space alert so we aren't caught unawares when it gets low.
Sentry data retention looks to be the last three months and Ryan will do some further digging, look at the config to get more information.
Created a DataDog dashboard with RDS Sentry metrics for Free/Total/Used Space - https://vagov.ddog-gov.com/dashboard/w2n-vq9-v75?fromUser=true&refresh_mode=paused&view=spans&from_ts=1704085200000&to_ts=1711046880000&live=false
Created a RDS Sentry monitor / alert for when RDS Sentry Used Space is above 80% - https://vagov.ddog-gov.com/monitors/209510?view=spans
This alert is correctly alerting in #platform-cop-be-notificatoins in Slack - https://dsva.slack.com/archives/C039HRTHXDH/p1711047114147909
Documentation is the only element remaining for this work
We know the alert works!!! Well done @rjohnson2011 !
Platform Alerts Documentation updated with new monitor / alert / dashboard links
@rjohnson2011 is finishing up the terraform work and this ticket will close today.
Description
In our utility environment, the Sentry instance is at risk of encountering a full hard drive (HD), which could lead to data loss, service disruption, and an inability to process incoming error reports. To proactively manage this risk, we need to implement an alert mechanism that notifies us well before the HD reaches its full capacity. This early warning system will allow us to take necessary actions, such as increasing storage capacity or optimizing current storage usage, thereby ensuring continuous operation of our Sentry instance without interruption.
Acceptance Criteria (AC)
Alert Implementation: An alert mechanism in Datadog is implemented
Threshold Configuration: The alert is configured to trigger when the hard drive usage reaches 80% of its total capacity, providing ample time to respond to the storage needs.
Notification System: Upon triggering, the alert sends notifications to the designated team via Slack or PagerDuty
Documentation: Documentation is updated or created to include:
By fulfilling these acceptance criteria, we can ensure our Sentry instance remains operational and effective in capturing and processing error reports, thereby maintaining the reliability of our applications.