department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 202 forks source link

prometheus inhibit rules for VA network blips #2924

Closed wyattwalter closed 2 years ago

wyattwalter commented 4 years ago

Background

We have seen some blips in connectivity between VAEC in AWS and a bunch of endpoints on the VA WAN where alerts went a bit crazy. This creates a good deal of noise and makes it difficult for us to decipher what the real issue might be. Prometheus AlertManager has a concept of alert "inhibitions" that we use for a couple of things, but would likely work really well here.

Need to investigate what happened starting here: https://dsva.slack.com/archives/C30LCU8S3/p1572412197196100 to see how we might narrow this down.

I think that we could probably find an endpoint across the VPN link to check our connectivity to each of the gateways, and inhibit any ExternalServiceAvailabilityCritical alerts based upon at least that. We also kind of know where different backends exist (at least within the datacenter), so it might be worth making a pass at attempting to monitor our ability to reach something within those datacenters as another way to bubble up only a single alert.

Relatedly, the EWIS proxy alerts and website error rates are pretty tightly coupled due to the way the integration works currently. We should be able to write an inhibit rule for ApplicationErrorRateCritical on the website component and ExternalServiceAvailabilityCritical on the ewis-proxy so that we only get the latter alert.

Goal

Create rules, so we get notifications when entire networks or services are down

AC

wyattwalter commented 4 years ago

Some early discovery so far here on this particular event, and trying to map out a potential solution.

Events fired when Large Network Event happened

Consolidated into single incident per service

ExternalServiceAvailabilityCritical

Labels:

Consolidated into single incident in "External: Appeals

Labels:

Labels:

Consolidated into single incident in "DevOps: Critical"

Labels:

Labels:

Labels:

Route53 Health Check to DevOps NonCritical

"AlarmDescription": "VA DEV API is unreachable" <-- this was the only Route53 health check that fired for some reason.

Possible solution

The event consolidation into a single incident worked flawlessly on the PagerDuty side. Unfortunately, we have each of these backends we utilize as separate Services (PagerDuty construct) in PagerDuty. I'm not aware of a way to consolidate events across several Services into a single incident, but Prometheus can help us filter these before sending along to PagerDuty.

A possible solution to this is a structure of inhibit rules with preference:

Network reachability alarms:

Things we would need to build:

Then, we'd need a set of alerts with standard tags (examples):

We could possibly hardcode labels for which datacenter a service exists in. They don't move terribly often, but would be relatively easy for them to fall out-of-date. The IP address of resolved addresses is the most reliable source of truth here, but that's kind of tricky to massage into Prom config or relabeling (maybe). We'd also have to account for any multi-datacenter apps, but today I'm not aware of any except ones that rely on components in multiple dc's.

Then, we'd have a set of inhibit rules something along the lines of:

  - target_match:
      alertname: ExternalServiceHealthCheckCritical
    source_match:
      alertname: ExternalServiceReachabilityCritical
    equals:
      - service

  - target_match:
      alertname: ExternalServiceAvailabilityCritical
    source_match:
      alertname: DatacenterReachabilityCritical
    equals:
      - datacenter

  - target_match:
      alertname: DatacenterReachabilityCritical
    source_match:
      alertname: TICGatewayReachability
ricetj commented 4 years ago

This may depend on what we do with Data dog.

rmtolmach commented 2 years ago

Good stuff here. We'll need to do something similar in Datadog.

We're starting to build out infra dashboards in datadog. See https://github.com/department-of-veterans-affairs/va.gov-team/issues/45202. Closing.