PagerDuty / incident-response-docs

PagerDuty's Incident Response Documentation.
https://response.pagerduty.com
Apache License 2.0
1.02k stars 221 forks source link

Remove references to "root cause(s)", as incidents should focus on all "contributing factors", including the trigger, that lead to the incident #95

Open theckman opened 5 years ago

theckman commented 5 years ago

One change we are seeing in our industry is the wider adoption of the belief that being able to distill an incident down to a single root cause is a myth[1][2]. As the complexities of our systems grow the complexities of our incidents grow, and trying to isolate an incident to one item doesn't result in the types of learnings we need to come out of those incidents.

The truth is that each incident is unique because of the multiple factors that contributed to it, and if any one of those factors was different it would have been a completely different incident. Without giving each of those factors the same care, we miss the opportunity to solve for those different parts.

While pluralizing "root cause" to "root causes" can get you a good part of the way there, in my experience I've seen that the verbiage change from "root causes" to "contributing factors" is a much bigger change in how people think about it and drive the learnings in the way we want. While I initially was skeptical such a minimal language change would make a difference, I can happily admit I was wrong.

At Netflix we've started to change our internal language around it, and have found a much richer set of learnings from teams after an incident. Being that I was a responder at PagerDuty when we started to form these practices and the inception for this documentation, I feel like it'd be a miss if we didn't iterate on these documents to follow with learnings from our industry.

We, and others, have started to talk about Contributing Factors instead. We still identify what was traditionally called the "root cause", but we listed it as one of the factors (often called out as the trigger).

What are your thoughts on updating the verbiage of this documentation to align with our industry shifting its way of thinking?

[1] https://medium.com/@jpaulreed/dev-ops-and-determinism-966a57e3a5cc [2] https://en.wikipedia.org/wiki/Fallacy_of_the_single_cause