department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
284 stars 206 forks source link

[Platform SRE] Update the incident response process #36163

Open jhouse-solvd opened 2 years ago

jhouse-solvd commented 2 years ago

Problem Statement

The incident response process is unclear. It is unclear when to treat an issue as an incident. And, it is unclear how to classify, prioritize, and respond to incidents when they occur.

Background / Context

How might we

...update the incident response process so that it's clear for end-users? ...update the incident response process so that it's clear for SRE personnel? ...update the incident response process with guidance for classification and prioritization? ...update the incident process with communication protocol and responsibilities?

Hypothesis or Bet

This will make it easier for VFS teams to declare an incident. This will make it easier for SRE (and platform teams) to respond to incidents.

We will know we're done when... ("Definition of Done")

When there is an updated incident response process in place.

Known Blockers/Dependencies

List any blockers or dependencies for this work to be completed

Projected Launch Date

TBD

Launch Checklist

Is this service / tool / feature...

... tested?

... documented?

... measurable

When you're ready to launch...

Required Artifacts

Documentation

Testing

Measurement

jhouse-solvd commented 2 years ago

Existing oncall rotation doc that contains links to incident categorization, incident command info, incident response playbook

https://github.com/department-of-veterans-affairs/va.gov-team-sensitive/tree/master/OnCall

jhouse-solvd commented 2 years ago

One of the first things that we need to do is to understand what exists already.

We can refer to the documentation posted above and try to wrap our understanding around existing processes and information.

jhouse-solvd commented 2 years ago

It could be worth it to workshop incident categorization and prioritization based on actual incidents that have occurred over the past couple of years.

zachclarity commented 2 years ago

@jhouse-solvd Which of the following should be covered by the Incident Plan ? https://docs.google.com/spreadsheets/d/1Fn2lD419WE3sTZJtN2Ensrjqaz0jH3WvLaBtn812Wjo

oseasmoran73 commented 2 years ago

A couple of questions: