Open gracekretschmer-metrostar opened 7 months ago
Tyler: need to do data dog discovery to understand how the metric gets restarted. Tim can provide support and fill in information for understanding data dog.
The event that this monitor & alert hinges on is sent from the Content Release workflow into Datadog.
Success is sent from here: https://github.com/department-of-veterans-affairs/content-build/blob/main/.github/workflows/content-release.yml#L522
Explicit failure is sent from here: https://github.com/department-of-veterans-affairs/content-build/blob/main/.github/workflows/content-release.yml#L627
The way the content release monitor is expected to work at a basic level is, the monitor measures time since the last success
event. That is why the monitor appears to rise slowly and then suddenly drop. Each point in the monitor is 'time since last success', so it rises over time and then drops suddenly when a new success comes in.
Content Release is meant to run continuously between 8am and 8pm. Right now CMS editor expectation is that their content will be published on VA.gov within 2 hours of being saved in Drupal. This monitor is meant to track that we are meeting that expectation, and alert when we do not.
The reasons we do this in addition to alerting on failure are:
In terms of how this is supposed to behave:
When I search under Monitors for "Time Since Last Content Release Monitor" I get three results.
https://vagov.ddog-gov.com/monitors/manage?q=Time+Since+Last+Content+Release+Monitor&order=desc
They all claim to be "Managed by Terraform", but I only see one result in the devops
repo for this monitor.
I've been having discussions in Slack with @flooose about Datadog monitors. We are going to meet on Monday to discuss how Terraform is trying to be used to make monitors great again.
We rescheduled to meet tomorrow.
Today we also got an alert for a similar monitor:
It could be that a couple of releases in a row take a long time, they can produce this message.
Like when release A takes 1 hr 2 min, and release B takes 1 hr and 3 mins, then it's been over two hours since the content release's event has been reset.
There are three monitors that match the string "[CMS] Time Since Last Content Release Monitor".
The third monitor is the one referenced in this ticket and alerted in Slack.
At this time we've muted it because we've seen repeated instances of the alert going off when in fact kind of telemetry it's tracking (an event log) is not being reset properly.
The event it's measuring is whether or not a content-release workflow job has completed successfully.
The alert will go off, and we'll check the Content Release workflow, and it will look all green.
We occasionally also see the second monitor. For example today:
At this point we see three successful events with time point 1,2,3.
But in the metric the duration isn't reset on time points 2 and 3.
The goal is to have this finished by end of sprint 10 (5/22).
We are going to replace the monitor named:
With the new monitor named:
Additionally I have selected the vagov-cms
as the name of the service tag at the end of the monitor.
Tested the alerts and pagerduty.
Yet, for some reason it continues to alert, even if we've seen one notification or have acknowledged it.
I'll be looking next to figure out how to get it only to alert once.
Will work on rescoping for sprint 13. Sitting in parking lot for now.
Just a note on this ticket: we should revisit this.
Pulling Tim into this work to help support.
I am testing an alternate formulation for the monitor: https://vagov.ddog-gov.com/monitors/289659
This simply looks for a content release success event over the previous 2 hours. If that number drops below 1, it alerts.
User Story or Problem Statement
It needs to be understood why the alert was incorrect sent to the cms-notifications channel and a solution implemented within data dog to prevent the false positive alerts in the future.
Reference Links
Description or Additional Context
On 4/15 and 4/16, a false positive alert (Triggered: [CMS] Time Since Last Content Release Monitor (SLA Limit)) has been sent to the cms-notifications channel within VA slack. This alert need to be prioritized in fixing because it will be monitor by the new Watchtower program that will be implemented later this year.
Steps for Implementation
Acceptance Criteria