Azure / azure-sdk-tools

Tools repository leveraged by the Azure SDK team.
MIT License
113 stars 177 forks source link

Create and document incident management process #3474

Open benbp opened 2 years ago

benbp commented 2 years ago

It would be useful to have a form of checklist that people can follow to help determine roles during the early stages of an incident, along with the first steps and priorities that each role should consider. For example:

Additionally, we need better centralized documentation on how to handle escalations to various dependencies, e.g. internal common engineering services (pipeline agents, container registries, azure services), github, package management hosts, etc.

kurtzeborn commented 2 years ago

We already have this for ICM incidents: https://dev.azure.com/azure-sdk/internal/_wiki/wikis/internal.wiki/395/On-Call-for-Azure-SDKs However, we should create an "EngSys Outages" type page based on the lessons learned in the post-mortem.