department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
282 stars 201 forks source link

Migrate va.gov search #39505

Closed pjhill closed 2 years ago

pjhill commented 2 years ago

Issue Description

vagov-search canary appears to be in an error state. Still, let's move this out of AWS Cloudwatch and into Datadog.


Tasks

Acceptance Criteria

rbeckwith-oddball commented 2 years ago

The monitor has been created and is now running. Details can be found here.

The settings that we may want to adjust (and this will be for all migrated monitors):

  1. Occurrence (currently set to every 15 minutes)
  2. Priority level (currently info)
  3. Slack channel to alert a. How many failures in a row do we want to count as a notify-worthy alert

I have not configured it to alert any channels as of yet until after review

pjhill commented 2 years ago

Answers to questions around failures and alerting:

  1. CloudWatch was setup with the following rule --

To prevent overly noise errors to PagerDuty, change the Period to ‘15 minutes' then define the alarm condition as Greater/Equal to 2. This means the failure must occur at least twice within a 15 minute period to trigger an alert to team members.

Is a similar behavior possible in DataDog? In other words can we execute the test say every 5 minutes, and if we get two failures in a 15 minute period then we alert?

  1. What are the choices? This affects what exactly?
  2. oncall is where CloudWatch was pointed, I believe

    a. The current rules are above -- if we can re-create that then that would be best.

ddzz commented 2 years ago

@rbeckwith-oddball the Datadog test does not include any assertions around the search results, but the canary does. Can we add assertions?

rbeckwith-oddball commented 2 years ago

Added an assertion for the search results @ddzz

pjhill commented 2 years ago

@rbeckwith-oddball -- Couple Q's --

  1. How often is this alerting?
  2. If it appears to be alerting on the error conditions but nothing requires intervention then we should tune the alerting.
  3. Is this connected to PagerDuty?
  4. The previous monitors alerted to Slack only via PagerDuty I believe, is that right? These monitors appear to be alerting Slack directly with failures.