Our documentation of incident-handling procedures and tools have proliferated and it's been difficult to keep them up-to-date with evolving tools and conventions. This [work-in-progress] update attempts to consolidate them into a single doc. It doesn't yet remove the pre-existing docs.
For instructions that are occasionally relevant, but shouldn't get in the way of an on-call engineer who may be under pressure to respond to an incident, I employ the <details><summary>... syntax to keep that content collapsed by default.
(Some of the unrelated edits and questionable whitespace choices are required by the pre-commit hook, not me.)
Thanks for all the feedback. I'll merge, noting these areas of possible improvement:
clarify how to identify additional responders, especially once projects' point people are updated and spread around geographies
use incident priorities more strategically, perhaps by encouraging tracking of "low"/"informational" incidents that might not trigger on-call notifications
Our documentation of incident-handling procedures and tools have proliferated and it's been difficult to keep them up-to-date with evolving tools and conventions. This [work-in-progress] update attempts to consolidate them into a single doc. It doesn't yet remove the pre-existing docs.
For instructions that are occasionally relevant, but shouldn't get in the way of an on-call engineer who may be under pressure to respond to an incident, I employ the
<details><summary>...
syntax to keep that content collapsed by default.(Some of the unrelated edits and questionable whitespace choices are required by the pre-commit hook, not me.)