This PR is to get feedback on the initial suggested changes to our escalation procedure.
There are longer-term suggestions for how we can better utilize Opsgenie coming out of this work, but here is what I considered in scope for the initial updates:
Improve the ability for an on-call engineer to involve Subject Matter Experts in incidents.
Automate team alerts through routing rules and escalation policies.
If this is adopted the following steps will need to be done to configure Opsgenie:
Create Teams in Opsgenie that map to our product teams.
Have each Team set up an escalation policy (providing a recommended default). As a simple example, this escalation policy would notify all members of the Platform team, immediately, once the team is added as a responder to the incident.
Add all PMs as Stakeholders in Opsgenie (Stakeholder accounts don't require additional user licenses).
Configure the stakeholders email template:
Potential Next Steps / considerations:
Bring more alerts directly into OpsGenie.
This will likely only be feasible if we clean up some of the current noise. But could allow us to automate the routing of issues directly to teams for assessment.
Update status page directly from the incident?
Consider adding Services, which could further automate the escalation policy at a team level.
Ultimately the below is the recommended workflow from Opsgenie, which is largely based on two things that we aren't currently doing - 1) feeding all alerts into Opsgenie, 2) utilizing team-based on-call schedules, rather than two assigned engineers.
This PR:
This PR is to get feedback on the initial suggested changes to our escalation procedure.
There are longer-term suggestions for how we can better utilize Opsgenie coming out of this work, but here is what I considered in scope for the initial updates:
If this is adopted the following steps will need to be done to configure Opsgenie:
Potential Next Steps / considerations:
Ultimately the below is the recommended workflow from Opsgenie, which is largely based on two things that we aren't currently doing - 1) feeding all alerts into Opsgenie, 2) utilizing team-based on-call schedules, rather than two assigned engineers.
https://artsyproduct.atlassian.net/browse/PLATFORM-3498