department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:

https://depo-platform-documentation.scrollhelp.site/index.html

284 stars 206 forks source link

Investigate BlackboxDNSReachabilityCritical and BlackboxMonitorCritical alerts #56469

Open pjhill opened 1 year ago

pjhill commented 1 year ago

Description

In the past 7 days, the BlackboxDNSReachabilityCritical alert has triggered 34 times. How might we reduce the number of times this alert is being triggered without increasing the risk of a critical failure going unnoticed? Might we change the threshold? Might we change the measure by which we are triggering?

Update alerts.md with working links and action items, if any.

Per Kyle, it might a good idea to deprecate this alert entirely and switch to DataDog for this alert. Is there already a replacement for this alert in DataDog? DataDog provides their own server to run synthetics from which is outside of our infra.

Acceptance Criteria

[ ] A strategy for reducing the number of times this alert is triggered has been identified

rmtolmach commented 1 year ago

Is this an actionable alert from a Platform perspective?
When would it BE actionable?
- If it were permanently down (and not flickering like it seems to be now). We might need to file a SNOW ticket.
This blackbox lives in our OTHER AWS account

This is where the alert lives in code: https://github.com/department-of-veterans-affairs/devops/blob/master/ansible/deployment/config/prometheus/rules/blackbox.rules#L46-L53

docs on muting an alert if this is too noisy in the middle of the night or something: https://vfs.atlassian.net/wiki/spaces/OT/pages/2440822789/Mute+Prometheus+Alert

rmtolmach commented 1 year ago

Other Blackbox... alerts have been firing frequently. BlackboxMonitorCritical has been active.

Past incidents for BlackboxMonitorCritical

Investigation for this alert done here: https://dsva.slack.com/archives/C30LCU8S3/p1683470195525129?thread_ts=1683458053.329979&cid=C30LCU8S3

Past incidents for BlackboxDNSReachabilityCritical

rmtolmach commented 1 year ago

This alert is still firing with an annoying amount of frequency.

BlackboxMonitorCritical

ryan-mcneil commented 1 year ago

I'll also add that the time it's taking to resolve is creeping up. I've had several today that have taken an hour to resolve, including this one

rmtolmach commented 1 year ago

@ph-One increased the blackbox exporter instance size in https://github.com/department-of-veterans-affairs/devops/pull/13118. The instance was running out of memory. Let's see if this solves the problem.

rmtolmach commented 1 year ago

Kyle's fix on 6/8 pretty much solved this until 6/29 when BlackboxDNSReachabilityCritical started back up again. The current pattern of alerts is a few per day, all resolving in ~5 min.