Investigate Triggered: [CMS] Time Since Last Content Release Monitor (SLA Limit) False Positives Alerts

gracekretschmer-metrostar commented 7 months ago

User Story or Problem Statement

It needs to be understood why the alert was incorrect sent to the cms-notifications channel and a solution implemented within data dog to prevent the false positive alerts in the future.

Reference Links

Description or Additional Context

On 4/15 and 4/16, a false positive alert (Triggered: [CMS] Time Since Last Content Release Monitor (SLA Limit)) has been sent to the cms-notifications channel within VA slack. This alert need to be prioritized in fixing because it will be monitor by the new Watchtower program that will be implemented later this year.

Steps for Implementation

Chase this possible lead as causing the false positive: https://dsva.slack.com/archives/CT4GZBM8F/p1713212685177989
Investigate the root source that is causing the false positive.
A fix is implemented in data dog to prevent the false positives in the future.

Acceptance Criteria

[ ] [CMS] Time Since Last Content Release Monitor alert are being sent to the cms-notifications channel
[ ] [CMS] Time Since Last Content Release Monitor (2 hours) alert are being sent to the cms-notifications channel
[ ] The Triggered: [CMS] Time Since Last Content Release Monitor (SLA Limit) alert is only sent to the cms-notification channel when there has been NO content release for more than 4 hours.

gracekretschmer-metrostar commented 6 months ago

Tyler: need to do data dog discovery to understand how the metric gets restarted. Tim can provide support and fill in information for understanding data dog.

timcosgrove commented 6 months ago

The event that this monitor & alert hinges on is sent from the Content Release workflow into Datadog.

Success is sent from here: https://github.com/department-of-veterans-affairs/content-build/blob/main/.github/workflows/content-release.yml#L522

Explicit failure is sent from here: https://github.com/department-of-veterans-affairs/content-build/blob/main/.github/workflows/content-release.yml#L627

The way the content release monitor is expected to work at a basic level is, the monitor measures time since the last success event. That is why the monitor appears to rise slowly and then suddenly drop. Each point in the monitor is 'time since last success', so it rises over time and then drops suddenly when a new success comes in.

Content Release is meant to run continuously between 8am and 8pm. Right now CMS editor expectation is that their content will be published on VA.gov within 2 hours of being saved in Drupal. This monitor is meant to track that we are meeting that expectation, and alert when we do not.

The reasons we do this in addition to alerting on failure are:

we are specifically measuring that we are meeting our agreement with CMS editors, and begin user support procedures when we are not
sometimes when there is a Content Release issue, there are multiple quick failures. There are also alerts for these; however, they are not "interesting" to CMS User Support or to CMS editors.
If for some reason the cyclic continuous runs of Content Release are themselves interrupted, there will not be alerts, since the failure / success alerts trigger from the Content Release workflow.

In terms of how this is supposed to behave:

The alert should fire if the time since last release is over 2 hours, except as described below
The alert should fire during "business hours", which is 8 am - 8 pm ET, except as described below
The alert should not fire under any circumstances after 8 pm ET.
The alert should not fire immediately at 8 am. There will always be a 'failure state' at 8 am, since the last success event is likely to be between 8:00 - 9:00 pm. This last point is tricky. One way to deal with it is simply to not let the alert be active until 10am. There may be some edge cases with that approach which we should brainstorm.

7hunderbird commented 6 months ago

When I search under Monitors for "Time Since Last Content Release Monitor" I get three results.

https://vagov.ddog-gov.com/monitors/manage?q=Time+Since+Last+Content+Release+Monitor&order=desc

CleanShot 2024-04-25 at 13 54 45

They all claim to be "Managed by Terraform", but I only see one result in the devops repo for this monitor.

CleanShot 2024-04-25 at 13 55 19

Question

Are there other repositories that are managing Datadog monitors in Terraform?

7hunderbird commented 6 months ago

I've been having discussions in Slack with @flooose about Datadog monitors. We are going to meet on Monday to discuss how Terraform is trying to be used to make monitors great again.

7hunderbird commented 6 months ago

We rescheduled to meet tomorrow.

7hunderbird commented 6 months ago

Today we also got an alert for a similar monitor:

It could be that a couple of releases in a row take a long time, they can produce this message.

Like when release A takes 1 hr 2 min, and release B takes 1 hr and 3 mins, then it's been over two hours since the content release's event has been reset.

7hunderbird commented 6 months ago

There are three monitors that match the string "[CMS] Time Since Last Content Release Monitor".

CleanShot 2024-05-06 at 14 05 21

[CMS] Time Since Last Content Release Monitor
[CMS] Time Since Last Content Release Monitor (2 hours)
[CMS] Time Since Last Content Release Monitor (SLA Limit)

[CMS] Time Since Last Content Release Monitor (SLA Limit)

The third monitor is the one referenced in this ticket and alerted in Slack.

At this time we've muted it because we've seen repeated instances of the alert going off when in fact kind of telemetry it's tracking (an event log) is not being reset properly.

The event it's measuring is whether or not a content-release workflow job has completed successfully.

The alert will go off, and we'll check the Content Release workflow, and it will look all green.

CleanShot 2024-05-06 at 14 09 26

[CMS] Time Since Last Content Release Monitor (2 hour)

We occasionally also see the second monitor. For example today:

https://dsva.slack.com/archives/CDHBKAL9W/p1715031060008909

At this point we see three successful events with time point 1,2,3.

CleanShot 2024-05-06 at 16 22 40

But in the metric the duration isn't reset on time points 2 and 3.

CleanShot 2024-05-06 at 16 23 50

gracekretschmer-metrostar commented 6 months ago

The goal is to have this finished by end of sprint 10 (5/22).

7hunderbird commented 6 months ago

We are going to replace the monitor named:

"[CMS] Time Since Last Content Release Monitor (SLA Limit)"

With the new monitor named:

"[CMS] Two or more consecutive deployment failures of Content Release"

Additionally I have selected the vagov-cms as the name of the service tag at the end of the monitor.

CleanShot 2024-05-21 at 12 59 20

7hunderbird commented 6 months ago

Tested the alerts and pagerduty.

Yet, for some reason it continues to alert, even if we've seen one notification or have acknowledged it.

I'll be looking next to figure out how to get it only to alert once.

gracekretschmer-metrostar commented 5 months ago

Will work on rescoping for sprint 13. Sitting in parking lot for now.

timcosgrove commented 2 months ago

Just a note on this ticket: we should revisit this.

gracekretschmer-metrostar commented 1 month ago

Pulling Tim into this work to help support.

timcosgrove commented 3 weeks ago

I am testing an alternate formulation for the monitor: https://vagov.ddog-gov.com/monitors/289659

This simply looks for a content release success event over the previous 2 hours. If that number drops below 1, it alerts.

department-of-veterans-affairs / va.gov-cms