prometheus inhibit rules for VA network blips

wyattwalter commented 4 years ago

Background

We have seen some blips in connectivity between VAEC in AWS and a bunch of endpoints on the VA WAN where alerts went a bit crazy. This creates a good deal of noise and makes it difficult for us to decipher what the real issue might be. Prometheus AlertManager has a concept of alert "inhibitions" that we use for a couple of things, but would likely work really well here.

Need to investigate what happened starting here: https://dsva.slack.com/archives/C30LCU8S3/p1572412197196100 to see how we might narrow this down.

I think that we could probably find an endpoint across the VPN link to check our connectivity to each of the gateways, and inhibit any ExternalServiceAvailabilityCritical alerts based upon at least that. We also kind of know where different backends exist (at least within the datacenter), so it might be worth making a pass at attempting to monitor our ability to reach something within those datacenters as another way to bubble up only a single alert.

Relatedly, the EWIS proxy alerts and website error rates are pretty tightly coupled due to the way the integration works currently. We should be able to write an inhibit rule for ApplicationErrorRateCritical on the website component and ExternalServiceAvailabilityCritical on the ewis-proxy so that we only get the latter alert.

Goal

Create rules, so we get notifications when entire networks or services are down

AC

[ ] Test if solution works
[ ] building out things
[ ] establist process
[ ] Document

wyattwalter commented 4 years ago

Some early discovery so far here on this particular event, and trying to map out a potential solution.

Events fired when Large Network Event happened

Consolidated into single incident per service

ExternalServiceAvailabilityCritical

ewis
tims
es
evss
mvi
mhv

Labels:

alertname = ExternalServiceAvailabilityCritical
env = prod
monitor = monitor-prod
scope = external-service
service = tims
severity = page

Consolidated into single incident in "External: Appeals

Labels:

alertname = Upstream5xxErrorRateCritical
env = prod
monitor = monitor-prod
scope = api_gateway
service = appeals
severity = warn

Labels:

alertname = BlackboxAPIGatewayReachabilityCritical
env = prod
monitor = monitor-prod
path = /services/appeals/v0/healthcheck
scope = api_gateway
service = appeals
severity = warn

Consolidated into single incident in "DevOps: Critical"

Labels:

alertname = SiteErrorRateCritical
env = prod
monitor = monitor-prod
scope = site
severity = page

Labels:

alertname = SiteLatencyCritical
env = prod
monitor = monitor-prod
scope = site
severity = page

Labels:

alertname = BlackboxSiteReachabilityCritical
env = prod
monitor = monitor-prod
scope = site
severity = page

Route53 Health Check to DevOps NonCritical

"AlarmDescription": "VA DEV API is unreachable" <-- this was the only Route53 health check that fired for some reason.

Possible solution

The event consolidation into a single incident worked flawlessly on the PagerDuty side. Unfortunately, we have each of these backends we utilize as separate Services (PagerDuty construct) in PagerDuty. I'm not aware of a way to consolidate events across several Services into a single incident, but Prometheus can help us filter these before sending along to PagerDuty.

A possible solution to this is a structure of inhibit rules with preference:

Network reachability alarms:

TIC Gateway reachability
Datacenter reachability
Service reachability
Service health check

Things we would need to build:

Find a way to authoritatively check connectivity to TIC and monitor it
Find a way to authoritatively check connectivity to each datacenter and monitor them
Service network monitoring (can we reach the host on the TCP port?)
Mapping of services to datacenters

Then, we'd need a set of alerts with standard tags (examples):

scope: gateway-connectivity, gateway: east
scope: datacenter-connectivity, datacenter: crrc
scope: service-connectivity, service: ewis
scope: service-health, service: ewis

We could possibly hardcode labels for which datacenter a service exists in. They don't move terribly often, but would be relatively easy for them to fall out-of-date. The IP address of resolved addresses is the most reliable source of truth here, but that's kind of tricky to massage into Prom config or relabeling (maybe). We'd also have to account for any multi-datacenter apps, but today I'm not aware of any except ones that rely on components in multiple dc's.

Then, we'd have a set of inhibit rules something along the lines of:

  - target_match:
      alertname: ExternalServiceHealthCheckCritical
    source_match:
      alertname: ExternalServiceReachabilityCritical
    equals:
      - service

  - target_match:
      alertname: ExternalServiceAvailabilityCritical
    source_match:
      alertname: DatacenterReachabilityCritical
    equals:
      - datacenter

  - target_match:
      alertname: DatacenterReachabilityCritical
    source_match:
      alertname: TICGatewayReachability

ricetj commented 4 years ago

This may depend on what we do with Data dog.

rmtolmach commented 2 years ago

Good stuff here. We'll need to do something similar in Datadog.

We're starting to build out infra dashboards in datadog. See https://github.com/department-of-veterans-affairs/va.gov-team/issues/45202. Closing.

department-of-veterans-affairs / va.gov-team