department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
99 stars 69 forks source link

Investigate Triggered: [CMS] Time Since Last Content Release Monitor (SLA Limit) False Positives Alerts #17881

Open gracekretschmer-metrostar opened 7 months ago

gracekretschmer-metrostar commented 7 months ago

User Story or Problem Statement

It needs to be understood why the alert was incorrect sent to the cms-notifications channel and a solution implemented within data dog to prevent the false positive alerts in the future.

Reference Links

Description or Additional Context

On 4/15 and 4/16, a false positive alert (Triggered: [CMS] Time Since Last Content Release Monitor (SLA Limit)) has been sent to the cms-notifications channel within VA slack. This alert need to be prioritized in fixing because it will be monitor by the new Watchtower program that will be implemented later this year.

Steps for Implementation

Acceptance Criteria

gracekretschmer-metrostar commented 6 months ago

Tyler: need to do data dog discovery to understand how the metric gets restarted. Tim can provide support and fill in information for understanding data dog.

timcosgrove commented 6 months ago

The event that this monitor & alert hinges on is sent from the Content Release workflow into Datadog.

Success is sent from here: https://github.com/department-of-veterans-affairs/content-build/blob/main/.github/workflows/content-release.yml#L522

Explicit failure is sent from here: https://github.com/department-of-veterans-affairs/content-build/blob/main/.github/workflows/content-release.yml#L627

The way the content release monitor is expected to work at a basic level is, the monitor measures time since the last success event. That is why the monitor appears to rise slowly and then suddenly drop. Each point in the monitor is 'time since last success', so it rises over time and then drops suddenly when a new success comes in.

Content Release is meant to run continuously between 8am and 8pm. Right now CMS editor expectation is that their content will be published on VA.gov within 2 hours of being saved in Drupal. This monitor is meant to track that we are meeting that expectation, and alert when we do not.

The reasons we do this in addition to alerting on failure are:

In terms of how this is supposed to behave:

7hunderbird commented 6 months ago

When I search under Monitors for "Time Since Last Content Release Monitor" I get three results.

https://vagov.ddog-gov.com/monitors/manage?q=Time+Since+Last+Content+Release+Monitor&order=desc

CleanShot 2024-04-25 at 13 54 45

They all claim to be "Managed by Terraform", but I only see one result in the devops repo for this monitor.

CleanShot 2024-04-25 at 13 55 19

Question

7hunderbird commented 6 months ago

I've been having discussions in Slack with @flooose about Datadog monitors. We are going to meet on Monday to discuss how Terraform is trying to be used to make monitors great again.

7hunderbird commented 6 months ago

We rescheduled to meet tomorrow.

7hunderbird commented 6 months ago

Today we also got an alert for a similar monitor:

It could be that a couple of releases in a row take a long time, they can produce this message.

Like when release A takes 1 hr 2 min, and release B takes 1 hr and 3 mins, then it's been over two hours since the content release's event has been reset.

7hunderbird commented 6 months ago

There are three monitors that match the string "[CMS] Time Since Last Content Release Monitor".

CleanShot 2024-05-06 at 14 05 21

  1. [CMS] Time Since Last Content Release Monitor
  2. [CMS] Time Since Last Content Release Monitor (2 hours)
  3. [CMS] Time Since Last Content Release Monitor (SLA Limit)

[CMS] Time Since Last Content Release Monitor (SLA Limit)

The third monitor is the one referenced in this ticket and alerted in Slack.

At this time we've muted it because we've seen repeated instances of the alert going off when in fact kind of telemetry it's tracking (an event log) is not being reset properly.

The event it's measuring is whether or not a content-release workflow job has completed successfully.

The alert will go off, and we'll check the Content Release workflow, and it will look all green.

CleanShot 2024-05-06 at 14 09 26

[CMS] Time Since Last Content Release Monitor (2 hour)

We occasionally also see the second monitor. For example today:

At this point we see three successful events with time point 1,2,3.

CleanShot 2024-05-06 at 16 22 40

But in the metric the duration isn't reset on time points 2 and 3.

CleanShot 2024-05-06 at 16 23 50

gracekretschmer-metrostar commented 6 months ago

The goal is to have this finished by end of sprint 10 (5/22).

7hunderbird commented 6 months ago

We are going to replace the monitor named:

With the new monitor named:

Additionally I have selected the vagov-cms as the name of the service tag at the end of the monitor.

CleanShot 2024-05-21 at 12 59 20

7hunderbird commented 6 months ago

Tested the alerts and pagerduty.

Yet, for some reason it continues to alert, even if we've seen one notification or have acknowledged it.

I'll be looking next to figure out how to get it only to alert once.

gracekretschmer-metrostar commented 5 months ago

Will work on rescoping for sprint 13. Sitting in parking lot for now.

timcosgrove commented 2 months ago

Just a note on this ticket: we should revisit this.

gracekretschmer-metrostar commented 1 month ago

Pulling Tim into this work to help support.

timcosgrove commented 3 weeks ago

I am testing an alternate formulation for the monitor: https://vagov.ddog-gov.com/monitors/289659

This simply looks for a content release success event over the previous 2 hours. If that number drops below 1, it alerts.