artsy / README

:wave: - The documentation for being an Artsy Engineer
Creative Commons Attribution 4.0 International
1.1k stars 120 forks source link

[WIP] feat: add escalation instructions #413

Closed mc-jones closed 3 years ago

mc-jones commented 3 years ago

This PR:

This PR is to get feedback on the initial suggested changes to our escalation procedure.

There are longer-term suggestions for how we can better utilize Opsgenie coming out of this work, but here is what I considered in scope for the initial updates:

  1. Improve the ability for an on-call engineer to involve Subject Matter Experts in incidents.
  2. Automate team alerts through routing rules and escalation policies.

If this is adopted the following steps will need to be done to configure Opsgenie:

  1. Create Teams in Opsgenie that map to our product teams.
  2. Have each Team set up an escalation policy (providing a recommended default). As a simple example, this escalation policy would notify all members of the Platform team, immediately, once the team is added as a responder to the incident. Screen Shot 2021-10-08 at 4 50 12 PM
  3. Add all PMs as Stakeholders in Opsgenie (Stakeholder accounts don't require additional user licenses).
  4. Configure the stakeholders email template: Screen Shot 2021-10-08 at 4 59 10 PM

Potential Next Steps / considerations:

Ultimately the below is the recommended workflow from Opsgenie, which is largely based on two things that we aren't currently doing - 1) feeding all alerts into Opsgenie, 2) utilizing team-based on-call schedules, rather than two assigned engineers.

Screen Shot 2021-10-05 at 3 23 09 PM

https://artsyproduct.atlassian.net/browse/PLATFORM-3498