department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 204 forks source link

[Data] Create an EZR monitor for warning of multiple failures for a submission #92735

Closed luis-simauchi closed 1 month ago

luis-simauchi commented 1 month ago

We now have separate errors for normal failures and "total/exhausted" failures. We need to update our configuration and notification messaging in DD and the APM channel to reflect the changes

The update separates the two as follows: -- api.1010ezr.failed_wont_retry -- api.1010ezr.failed

Build a new monitor based of the api.1010ezr.failed stat D call

luis-simauchi commented 1 month ago

ten retries is going to be our warning threshold for the new alert

luis-simauchi commented 1 month ago

example query to model the new one after sum(last_10m):sum:vets_api.statsd.api_1010ezr_failed_wont_retry{env:eks-prod}.as_count() >= 1

luis-simauchi commented 1 month ago

monitor created: https://vagov.ddog-gov.com/monitors/274737