department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
93 stars 68 forks source link

[devops] Fix PagerDuty maintenance mode alert for DEV/STAGING #2888

Closed ElijahLynn closed 3 years ago

ElijahLynn commented 3 years ago

Currently our newly deployed deployment code actually activates maintenance window for all environments which are using "CMS Engineers Critical". We need only PROD to do that. DEV and STAGING should activate a maintenance window in "CMS Engineers Non-Critical".

ElijahLynn commented 3 years ago

K, @indytechcook and I discussed and we arrived at we can use the CMS Engineers Non-Critical. I started to make this change in Ansible and we will need a new "Escalation Policy" in Pager Duty but I don't have the ability to configure it as no teams show up in the drop down (I can create though).

DEV/STAGING do have a separate config block already and we can change it here > ansible/deployment/config/prometheus/rules/cms.rules. I am not seeing how to do this just yet though.

ALERT SiteReachableNonCritical
  IF script_success{script=~"cms-login-page-(dev|staging)"} == 0
  FOR 5m
  LABELS { project="cms", severity="page", scope="application", check="{{ $labels.script }}" }
  ANNOTATIONS {
    summary = "CMS login page not reachable from vets.gov utility",
    description = "The monitor probe to check {{ $labels.script }} failed from the vets.gov utility network. There may be an issue loading content from Drupal for website builds.  See https://github.com/department-of-veterans-affairs/va.gov-team-sensitive/blob/master/OnCall/alerts.md#sitereachablecritical"
  }
ElijahLynn commented 3 years ago

To better state the actual challenge here:

  1. We just started using the PagerDuty Maintenance Window support in BRD, so now deploys set a maintenance window in PagerDuty
  2. But, when our DEV deploy job sets a maintenance window, it sets it on the "CMS Engineers Critical" PagerDuty service, which is the same service PROD uses.
  3. Then PROD is effectively in a Maintenance Window and if there is a PROD outage we won't get notified.