Open benbp opened 2 years ago
We already have this for ICM incidents: https://dev.azure.com/azure-sdk/internal/_wiki/wikis/internal.wiki/395/On-Call-for-Azure-SDKs However, we should create an "EngSys Outages" type page based on the lessons learned in the post-mortem.
It would be useful to have a form of checklist that people can follow to help determine roles during the early stages of an incident, along with the first steps and priorities that each role should consider. For example:
Additionally, we need better centralized documentation on how to handle escalations to various dependencies, e.g. internal common engineering services (pipeline agents, container registries, azure services), github, package management hosts, etc.