Open pjhill opened 1 year ago
This is where the alert lives in code: https://github.com/department-of-veterans-affairs/devops/blob/master/ansible/deployment/config/prometheus/rules/blackbox.rules#L46-L53
docs on muting an alert if this is too noisy in the middle of the night or something: https://vfs.atlassian.net/wiki/spaces/OT/pages/2440822789/Mute+Prometheus+Alert
Other Blackbox...
alerts have been firing frequently. BlackboxMonitorCritical
has been active.
Past incidents for BlackboxMonitorCritical
Investigation for this alert done here: https://dsva.slack.com/archives/C30LCU8S3/p1683470195525129?thread_ts=1683458053.329979&cid=C30LCU8S3
Past incidents for BlackboxDNSReachabilityCritical
This alert is still firing with an annoying amount of frequency.
BlackboxMonitorCritical
I'll also add that the time it's taking to resolve is creeping up. I've had several today that have taken an hour to resolve, including this one
@ph-One increased the blackbox exporter instance size in https://github.com/department-of-veterans-affairs/devops/pull/13118. The instance was running out of memory. Let's see if this solves the problem.
Kyle's fix on 6/8 pretty much solved this until 6/29 when BlackboxDNSReachabilityCritical
started back up again. The current pattern of alerts is a few per day, all resolving in ~5 min.
Description
In the past 7 days, the BlackboxDNSReachabilityCritical alert has triggered 34 times. How might we reduce the number of times this alert is being triggered without increasing the risk of a critical failure going unnoticed? Might we change the threshold? Might we change the measure by which we are triggering?
Per Kyle, it might a good idea to deprecate this alert entirely and switch to DataDog for this alert. Is there already a replacement for this alert in DataDog? DataDog provides their own server to run synthetics from which is outside of our infra.
Acceptance Criteria