We should move and improve our current instance cycling alert like the ones observed here. In the event that caused this postmortem, we did not see this alert until almost a full hour into troubleshooting the issue.
Background/context
As part of our lift and shift to Datadog from Prometheus, this one should flow naturally with the rest of our efforts in this space. (No need to improve Prometheus.)
Technical notes
Notes around work that is happening, if applicable (optional, please delete if unused)
What is our threshold for the amount of time ASG cycles before we're alerted? Look at existing rule in Prometheus and translate to Datadog (improving as we go). See cluster.rules.
Tasks
[ ] Instance cycle warning alert moved to Datadog
[ ] Instance cycle warning alert improved upon so it does not take an hour to reach the Ops team
[ ] Send to Slack and verify that the alert is tuned and actionable.
[ ] PagerDuty integration for production related instance cycles would also likely be a good to have here
Acceptance Criteria
[ ] Alert moved from Prometheus to Datadog
[ ] More rapid feedback from said alert
[ ] Alerts are sent to on-call channel and PagerDuty for Prod and Sandbox
[ ] Prod on-call
[ ] Prod pd
[ ] Sandbox on-call
[ ] Sandbox pd
Reminders
[ ] Please attach your team label and any other appropriate label(s) (operations, devops, and needs-grooming will automatically be applied as part of the template)
[ ] Please connect to an epic (this will typically be done by the Platform Operations PM or TL)
Description
We should move and improve our current instance cycling alert like the ones observed here. In the event that caused this postmortem, we did not see this alert until almost a full hour into troubleshooting the issue.
Background/context
As part of our lift and shift to Datadog from Prometheus, this one should flow naturally with the rest of our efforts in this space. (No need to improve Prometheus.)
Technical notes
Notes around work that is happening, if applicable (optional, please delete if unused) What is our threshold for the amount of time ASG cycles before we're alerted? Look at existing rule in Prometheus and translate to Datadog (improving as we go). See cluster.rules.
Tasks
Acceptance Criteria
Reminders