Closed wyattwalter closed 2 years ago
Some early discovery so far here on this particular event, and trying to map out a potential solution.
ExternalServiceAvailabilityCritical
Labels:
Labels:
Labels:
Labels:
Labels:
Labels:
"AlarmDescription": "VA DEV API is unreachable" <-- this was the only Route53 health check that fired for some reason.
The event consolidation into a single incident worked flawlessly on the PagerDuty side. Unfortunately, we have each of these backends we utilize as separate Services (PagerDuty construct) in PagerDuty. I'm not aware of a way to consolidate events across several Services into a single incident, but Prometheus can help us filter these before sending along to PagerDuty.
A possible solution to this is a structure of inhibit rules with preference:
Network reachability alarms:
Things we would need to build:
Then, we'd need a set of alerts with standard tags (examples):
scope: gateway-connectivity, gateway: east
scope: datacenter-connectivity, datacenter: crrc
scope: service-connectivity, service: ewis
scope: service-health, service: ewis
We could possibly hardcode labels for which datacenter a service exists in. They don't move terribly often, but would be relatively easy for them to fall out-of-date. The IP address of resolved addresses is the most reliable source of truth here, but that's kind of tricky to massage into Prom config or relabeling (maybe). We'd also have to account for any multi-datacenter apps, but today I'm not aware of any except ones that rely on components in multiple dc's.
Then, we'd have a set of inhibit rules something along the lines of:
- target_match:
alertname: ExternalServiceHealthCheckCritical
source_match:
alertname: ExternalServiceReachabilityCritical
equals:
- service
- target_match:
alertname: ExternalServiceAvailabilityCritical
source_match:
alertname: DatacenterReachabilityCritical
equals:
- datacenter
- target_match:
alertname: DatacenterReachabilityCritical
source_match:
alertname: TICGatewayReachability
This may depend on what we do with Data dog.
Good stuff here. We'll need to do something similar in Datadog.
We're starting to build out infra dashboards in datadog. See https://github.com/department-of-veterans-affairs/va.gov-team/issues/45202. Closing.
Background
We have seen some blips in connectivity between VAEC in AWS and a bunch of endpoints on the VA WAN where alerts went a bit crazy. This creates a good deal of noise and makes it difficult for us to decipher what the real issue might be. Prometheus AlertManager has a concept of alert "inhibitions" that we use for a couple of things, but would likely work really well here.
Need to investigate what happened starting here: https://dsva.slack.com/archives/C30LCU8S3/p1572412197196100 to see how we might narrow this down.
I think that we could probably find an endpoint across the VPN link to check our connectivity to each of the gateways, and inhibit any
ExternalServiceAvailabilityCritical
alerts based upon at least that. We also kind of know where different backends exist (at least within the datacenter), so it might be worth making a pass at attempting to monitor our ability to reach something within those datacenters as another way to bubble up only a single alert.Relatedly, the EWIS proxy alerts and
website
error rates are pretty tightly coupled due to the way the integration works currently. We should be able to write an inhibit rule forApplicationErrorRateCritical
on thewebsite
component andExternalServiceAvailabilityCritical
on the ewis-proxy so that we only get the latter alert.Goal
Create rules, so we get notifications when entire networks or services are down
AC