department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
280 stars 195 forks source link

[Alerting] Improve instance cycling alerts (in response to recent incident) #33325

Open jbritt1 opened 2 years ago

jbritt1 commented 2 years ago

Description

We should move and improve our current instance cycling alert like the ones observed here. In the event that caused this postmortem, we did not see this alert until almost a full hour into troubleshooting the issue.

Background/context

As part of our lift and shift to Datadog from Prometheus, this one should flow naturally with the rest of our efforts in this space. (No need to improve Prometheus.)

Technical notes

Notes around work that is happening, if applicable (optional, please delete if unused) What is our threshold for the amount of time ASG cycles before we're alerted? Look at existing rule in Prometheus and translate to Datadog (improving as we go). See cluster.rules.


Tasks

Acceptance Criteria


Reminders

jhouse-solvd commented 1 year ago

@ph-One This may be interesting to tackle as part of #47025

Or, @npeterson54 as part of #46778

Let me know your thoughts?