Update Datadog Vets API Error Panel Monitors, and Alerts

ajmagdub commented 10 months ago

On January 3, 2024, a change to the Vets Website caused the VAOS::V2::Appointments controller to fail in parsing the start date, resulting in an HTTP 400 error. You can find more information about this issue in the DSVA Slack thread: https://dsva.slack.com/archives/CBU0KDSB1/p1704315634375599

However, these errors were not detected in the current Datadog VAOS Alert Dashboard, monitor, or alerts. This ticket aims to investigate why the current 'Sum of Vets API Errors' panel, as well as its associated monitors and alerts, did not capture this error. The goal is to modify them so that these errors can also be identified and captured.

olivereri commented 7 months ago

Using Datadog's metrics explorer and reading up on StatsD I have a better understanding about what might have gone wrong. The short of it is that the "Sum of Vets API Errors" uses a tag/key pair that is filtering away data we're interested in inside the monitor's query.

source_app:vaos must not be implemented in the StatsD instrumentation correctly because removing only it from the entirety of the metric monitor query results in graphs that match the mhv-appointments service page.

Before removing the offending tag/key pair:

After removing the offending tag/key pair:

The second graph closely matches the mhv-appointments Datadog service page where the issue was observed by no alert was triggered:

My intent is to retire the StatsD based metric monitors Sum of VAOS Vets API Errors and Count of VAOS Vets API Errors and instead create monitors based off of DD APM metrics that display in the 'mhv-appointments' service page.

olivereri commented 7 months ago

New monitors: https://vagov.ddog-gov.com/monitors/215209?view=spans

https://vagov.ddog-gov.com/monitors/215210?view=spans

va-albers commented 7 months ago

Thank you @olivereri . Since these do alert the Watch Officer would you mind adding a sentence to the monitor alerts saying what the Watch Officer should do when these trigger? Good traditional items would be "check the Slack message related to this message for updates" or "questions about alerts should be left in #appointments-alerts" or "#appointments-team" or whatever is appropriate.

department-of-veterans-affairs / va.gov-team

Update Datadog Vets API Error Panel Monitors, and Alerts #72931