Closed ajmagdub closed 7 months ago
Using Datadog's metrics explorer and reading up on StatsD I have a better understanding about what might have gone wrong. The short of it is that the "Sum of Vets API Errors" uses a tag/key pair that is filtering away data we're interested in inside the monitor's query.
source_app:vaos
must not be implemented in the StatsD instrumentation correctly because removing only it from the entirety of the metric monitor query results in graphs that match the mhv-appointments
service page.
Before removing the offending tag/key pair:
After removing the offending tag/key pair:
The second graph closely matches the mhv-appointments
Datadog service page where the issue was observed by no alert was triggered:
My intent is to retire the StatsD based metric monitors Sum of VAOS Vets API Errors
and Count of VAOS Vets API Errors
and instead create monitors based off of DD APM metrics that display in the 'mhv-appointments' service page.
Thank you @olivereri . Since these do alert the Watch Officer would you mind adding a sentence to the monitor alerts saying what the Watch Officer should do when these trigger? Good traditional items would be "check the Slack message related to this message for updates" or "questions about alerts should be left in #appointments-alerts" or "#appointments-team" or whatever is appropriate.
On January 3, 2024, a change to the Vets Website caused the VAOS::V2::Appointments controller to fail in parsing the start date, resulting in an HTTP 400 error. You can find more information about this issue in the DSVA Slack thread: https://dsva.slack.com/archives/CBU0KDSB1/p1704315634375599
However, these errors were not detected in the current Datadog VAOS Alert Dashboard, monitor, or alerts. This ticket aims to investigate why the current 'Sum of Vets API Errors' panel, as well as its associated monitors and alerts, did not capture this error. The goal is to modify them so that these errors can also be identified and captured.