department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 205 forks source link

Sentry Server - Create RDS alert #76625

Closed LindseySaari closed 7 months ago

LindseySaari commented 9 months ago

Description

In our utility environment, the Sentry instance is at risk of encountering a full hard drive (HD), which could lead to data loss, service disruption, and an inability to process incoming error reports. To proactively manage this risk, we need to implement an alert mechanism that notifies us well before the HD reaches its full capacity. This early warning system will allow us to take necessary actions, such as increasing storage capacity or optimizing current storage usage, thereby ensuring continuous operation of our Sentry instance without interruption.

Acceptance Criteria (AC)

  1. Alert Implementation: An alert mechanism in Datadog is implemented

  2. Threshold Configuration: The alert is configured to trigger when the hard drive usage reaches 80% of its total capacity, providing ample time to respond to the storage needs.

  3. Notification System: Upon triggering, the alert sends notifications to the designated team via Slack or PagerDuty

  4. Documentation: Documentation is updated or created to include:

    • Details on the monitoring and alert system setup for the Sentry instance.
    • Steps to take when an alert is received, including how to increase storage capacity and how to check and optimize current storage usage.

By fulfilling these acceptance criteria, we can ensure our Sentry instance remains operational and effective in capturing and processing error reports, thereby maintaining the reliability of our applications.

rmtolmach commented 8 months ago

This page documents new monitors: https://vfs.atlassian.net/wiki/spaces/PPT/pages/2966159668/Vets+API+Alert+and+Monitoring+Trends Add to it as needed.

jennb33 commented 8 months ago

As of today @rjohnson2011 has charts built out, we want to have an alert that sends when we are at 80% in Pager Duty. It might also make sense to include a storage space alert so we aren't caught unawares when it gets low.

jennb33 commented 8 months ago

Sentry data retention looks to be the last three months and Ryan will do some further digging, look at the config to get more information.

rjohnson2011 commented 8 months ago

Created a DataDog dashboard with RDS Sentry metrics for Free/Total/Used Space - https://vagov.ddog-gov.com/dashboard/w2n-vq9-v75?fromUser=true&refresh_mode=paused&view=spans&from_ts=1704085200000&to_ts=1711046880000&live=false

Created a RDS Sentry monitor / alert for when RDS Sentry Used Space is above 80% - https://vagov.ddog-gov.com/monitors/209510?view=spans

This alert is correctly alerting in #platform-cop-be-notificatoins in Slack - https://dsva.slack.com/archives/C039HRTHXDH/p1711047114147909

jennb33 commented 7 months ago

Documentation is the only element remaining for this work

jennb33 commented 7 months ago

We know the alert works!!! Well done @rjohnson2011 !

rjohnson2011 commented 7 months ago

Platform Alerts Documentation updated with new monitor / alert / dashboard links

jennb33 commented 7 months ago

@rjohnson2011 is finishing up the terraform work and this ticket will close today.