department-of-veterans-affairs / abd-vro

To get Veterans benefits in minutes, VRO software uses health evidence data to help fast track disability claims.
Other
19 stars 6 forks source link

Define mean time to resolve an incident (mean time to resolve, MTTR) #3191

Closed meganhicks closed 1 month ago

meganhicks commented 1 month ago

As the VRO Team, we aim to define the mean time to resolve an incident. This will help us understand our average resolution time and identify areas for process improvement.

AC:

  1. This should only be for incidents where VRO is the root cause of the problem.
  2. Determine how VRO will measure the mean time to resolve an incident (MTTR). Ensure that MTTR is calculated by severity level.
  3. Determine how the team will form a baseline for this metric.
  4. Identify the tool(s) the team will use to measure the metric.
  5. Establish the cadence and process the team will follow to measure MTTR. The Enablement Team has requested that this metric be reported at each Sprint Review. This negotiable.
  6. Create documentation titled "Metrics" to ensure points 1-3 are agreed upon and communicated across the team.
  7. Ensure responsibilities align with "on call" documentation

Notes: Resolution should be considered "fixed" from the partner teams perspective.

brostk commented 1 month ago

Some initial research for whoever picks up this ticket: DataDog may work great for this (see https://docs.datadoghq.com/dora_metrics/failures/). We can hook up DD with PagerDuty using a webhook, and MTTR will be automatically calculated based on incident start and end times. Dora metrics with DD is currently in public beta as stated in that link though, which could be a red flag.

More information: https://docs.datadoghq.com/dora_metrics/setup/.

msnwatson commented 1 month ago

Some initial research for whoever picks up this ticket: DataDog may work great for this (see https://docs.datadoghq.com/dora_metrics/failures/). We can hook up DD with PagerDuty using a webhook, and MTTR will be automatically calculated based on incident start and end times. Dora metrics with DD is currently in public beta as stated in that link though, which could be a red flag.

More information: https://docs.datadoghq.com/dora_metrics/setup/.

I feel like this is exactly what I would want, but yeah, I don't feel comfortable with using a public beta 😅 we might have to do something a bit more manual for now unfortunately

lisac commented 1 month ago

for consideration: the Incident Report slack workflow is integrated with Pagerduty. As part of the opening actions of the workflow, an incident in Pagerduty is created by the slack workflow (similar to how it creates a GitHub issue for the incident). New as of this week, the slack workflow is also set up to mark the incident in Pagerduty as resolved, IF the responder follows through with slack workflow steps. It's also an option to go into the Pagerduty web ui directly to mark the incident as resolved.

brostk commented 1 month ago

It looks like PagerDuty has a dashboard which shows MTTR over some time range, under Analytics -> Dashboard. Definitely worth discussing whether this work well enough for now for this ticket. Mason suggested DataDog would be a better long term solution since our metrics and incident visualization would be consolidated, sometime after DD's DORA metrics feature exits public beta.

meganhicks commented 1 month ago

Maybe we can make this a 16th min? We need to have something even if the second iteration is DD.

msnwatson commented 1 month ago

The nice thing about the PagerDuty dashboard, is it already has a baseline established for our history of use of the tool for our incident management. And it does seem to support breakdowns by severity level out of the box as well. The goal would that current features would essentially trivialize this ticket and we just have to make sure the team knows how to find this dashboard and we can easily pull a graphic to include in any reports to the enablement team.

BerniXiongA6 commented 1 month ago

Hi @brostk would you be able to comment your progress on this ticket? Do you need more time or will this carry over to the next sprint?

brostk commented 1 month ago

Hi @brostk would you be able to comment your progress on this ticket? Do you need more time or will this carry over to the next sprint?

Sure - I'm about to finish up the documentation and post it on a wiki page. Currently working with the team to make sure we're on the same page about acceptance criteria 3 and 5. I'd expect to finish today.

BerniXiongA6 commented 1 month ago

thanks @brostk !