department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
78 stars 59 forks source link

Refine Datadog monitor for Forms GET alarms #17964

Open jilladams opened 2 weeks ago

jilladams commented 2 weeks ago

Status

[2024-04-25] Asked via Slack if Chris K. can estimate async so that we can pull it into Sprint 3 for Josh to work on.

User Story or Problem Statement

As a product team, I want to get Datadog alarms only if a problem is critical or ongoing, not for every blip.

Description or Additional Context

We own a Datadog synthetic monitor that sends a GET request to the vets-api /v0/forms endpoint every minute, and expects a response within 1000ms: [Synthetics] GET vets-api /v0/forms (prod)

Anytime a response is >1000ms, the alarm monitors. That's not useful. More notes here: https://dsva.slack.com/archives/C05THHJHH2R/p1714069662900159?thread_ts=1714068389.937709&cid=C05THHJHH2R

We want to update the monitor to only alarm if ... (figure out better criteria here, when we refine this ticket)

Acceptance Criteria

chriskim2311 commented 2 weeks ago

Some notes: Latency for response range was: 2071 - 54273ms

I would recommend either: Increasing the response time from the 1000ms. Increase the timeframe we evaluate the test in a failed state to alert. Right now this timeframe is at 5mins.

Either of these would be fairly easy to implement.