department-of-veterans-affairs / abd-vro

To get Veterans benefits in minutes, VRO software uses health evidence data to help fast track disability claims.
Other
19 stars 6 forks source link

Align and finalize all Incident Response documentation #3199

Closed bianca-rivera closed 1 month ago

bianca-rivera commented 2 months ago

Description

As the VRO Team we want to ensure we have alignment around all things incident response so the process rolls out as smoothly as possible.

AC

Resources

2570

Tasks

Incident Response Wiki Create new content:

Review & edit existing content:

bianca-rivera commented 1 month ago

Chat from VRO team discussion:

13:14:43 From Megan Hicks to Everyone: Imo we should a definition to this page-https://github.com/department-of-veterans-affairs/abd-vro/wiki/Incident-Response 13:14:43 From Bianca Rivera Alvelo to Everyone: It should be one this page - Incident Response. I think we can use that page as the central source for all things regarding incidents (definition, response process, comms standards, etc.) 13:15:02 From Kyle Brost to Everyone: Reacted to "It should be one thi..." with 👍 13:15:05 From Kyle Brost to Everyone: Removed a 👍 reaction from "It should be one thi..." 13:15:07 From Kyle Brost to Everyone: Reacted to "It should be one thi..." with ➕ 13:15:09 From Lisa Chung to Everyone: Reacted to "It should be one thi..." with ➕ 13:15:12 From Lisa Chung to Everyone: Reacted to "Imo we should a defi..." with ➕ 13:15:12 From Megan Hicks to Everyone: The partner teams should not have to concern themselves with how we categorize incidents. 13:15:22 From Kyle Brost to Everyone: Reacted to "The partner teams sh..." with ➕ 13:15:24 From Bianca Rivera Alvelo to Everyone: Reacted to "The partner teams sh..." with ➕ 13:16:27 From Lisa Chung to Everyone: Reacted to "The partner teams sh..." with ➕ 13:18:45 From Megan Hicks to Everyone: I think we should have definitions on wiki but we could also have them in the epics 13:19:16 From Megan Hicks to Everyone: This way whomever is on call can refer to the appropriate epic for step by step instructions 13:19:45 From Ponnia Muyen to Everyone: Reacted to "The partner teams sh..." with ➕ 13:24:45 From Bianca Rivera Alvelo to Everyone: Incident Definition 13:25:25 From Bianca Rivera Alvelo to Everyone: Root Cause Label and Definition 13:25:49 From Bianca Rivera Alvelo to Everyone: Incident Response Process & SLAs 13:26:45 From Megan Hicks to Everyone: Replying to "Incident Definition"

An incident is an event or occurrence, often unexpected or unusual, that can disrupt normal operations, processes, or activities on the VRO platform.

13:27:42 From Kyle Brost to Everyone: Replying to "Incident Definition"

An incident is an issue with our services which requires immediate attention to address, resolve, or investigate

13:28:19 From Ponnia Muyen to Everyone: Replying to "Incident Definition"

An incident is an unplanned event or occurrence that disrupts normal operations, services, or functions within an organization. Incidents can vary widely in scope and severity, from minor issues affecting a single user to major events impacting critical business operations. The key characteristics of an incident typically include:
Unplanned Event: An incident is not scheduled or expected and occurs unexpectedly, causing disruption.
Service Disruption: It disrupts the normal functioning of services, systems, or processes.
Impact on Partner teams: Potentially causing inconvenience or loss of productivity
Requires Response: An incident necessitates a response to mitigate its effects, restore normal operations, and prevent recurrence.

13:28:23 From Mason to Everyone: Replying to "Incident Definition" I’ll include a lot of Cheng’s language but incident defined as an event which results in significant performance degradation, service disruption, app outage, data loss, or privacy issues relating to any deployment or infrastructure on the VRO platform. Not necessarily caused by VRO team actions 13:28:24 From Cheng to Everyone: Replying to "Incident Definition"

An incident is an unplanned disruption or degradation of service that impacts normal operations or user experience.

13:28:58 From Berni Xiong (she/her) to Everyone: Reacted to "An incident is an un..." with ➕ 13:30:26 From Megan Hicks to Everyone: Replying to "Root Cause Label and..."

Root Cause for VRO: The root cause of the incident is determined after investigation, revealing that the failure is due to an issue on the VRO platform. These issues should directly tie to our team's quality metrics and are what we measure our MTTR (Mean Time to Resolve) around.

13:30:40 From Kyle Brost to Everyone: Reacted to "Root Cause for VRO: ..." with ➕ 13:30:52 From Ponnia Muyen to Everyone: Have to drop due to meeting conflict 13:31:58 From Megan Hicks to Everyone: Replying to "Root Cause Label and..."

Root Cause for LHDI: The root cause of the incident is determined after investigation, revealing that the failure is due to an issue on the LHDI platform. These issues are not within VROs control, but we must report them to LHDI and work to resolve them in partnership with LHDI.

13:32:10 From Berni Xiong (she/her) to Everyone: Reacted to "Root Cause for VRO: ..." with ➕ 13:32:16 From Berni Xiong (she/her) to Everyone: Reacted to "Root Cause for LHDI:..." with ➕ 13:32:24 From Bianca Rivera Alvelo to Everyone: Replying to "Incident Response Pr..."

(Might be good to screen share the wiki for this for reference)

13:33:06 From Kyle Brost to Everyone: Replying to "Root Cause Label and..."

Broadly, a root cause/label of an incident should be attributed to the team responsible for causing the issue, not necessarily the team responsible for addressing it.

13:35:02 From Bianca Rivera Alvelo to Everyone: Replying to "Root Cause Label and..."

Root Cause should be documented in Incident Reports and labeled on incident tickets. It should not be included in updates or other communication had with partner teams while the incident is being resolved to avoid bias of addressing incidents differently based on root cause.

13:35:19 From Megan Hicks to Everyone: Replying to "Root Cause Label and..."

Root Cause for Partner Team Application or External VA: The root cause of the incident is determined after investigation, revealing that the failure is due to an issue with the partner team application controlled by the partner team, or due to a VA external system not functioning appropriately. These issues are not within VROs control, but we must work in partnership with our partner teams to investigate and assist in their resolution.

13:36:43 From Lisa Chung to Everyone: Replying to "Root Cause Label and..."

Root cause for LHDI: traces to the administration of ArgoCD, Aqua, k8s, SecRel, or Vault.

13:36:46 From Megan Hicks to Everyone: Reacted to "An incident is an un..." with ➕ 13:36:57 From Megan Hicks to Everyone: Reacted to "An incident is an un..." with 👍 13:37:10 From Lisa Chung to Everyone: Reacted to "Root Cause for VRO: ..." with ➕ 13:37:56 From Berni Xiong (she/her) to Everyone: Reacted to "An incident is an un..." with 👍 13:38:22 From Kyle Brost to Everyone: Reacted to "I’ll include a lot o..." with ➕ 13:38:27 From Lisa Chung to Everyone: Reacted to "I’ll include a lot o..." with ➕ 13:38:28 From Lisa Chung to Everyone: Reacted to "An incident is an un..." with ➕ 13:38:53 From Cheng to Everyone: Reacted to "I’ll include a lot o..." with ➕ 13:39:13 From Lisa Chung to Everyone: Replying to "Incident Definition"

To Cheng’s, one proposed change: “impacts normal operations” —> “impacts or threatens to impact normal operations”

13:41:08 From Cheng to Everyone: Reacted to "Root Cause for VRO: ..." with ➕ 13:43:45 From Megan Hicks to Everyone: Replying to "Root Cause Label and..."

Also under LHDI I think it's good to state when reporting issues to LHDI we should state when we have blocking issues

13:44:23 From Erik Nelsestuen to Everyone: Replying to "Root Cause Label and..."

Any unplanned disruption or degradation of the common infrastructure elements VRO provides. These events are negatively impacting the availability, performance, security, or functionality of VRO service. Examples include (but not limited to):

13:44:32 From Erik Nelsestuen to Everyone: Replying to "Root Cause Label and..."

1. **Service Outages**: Complete or partial unavailability of provided infrastructure services.
2. **Performance Degradation**: Noticeable slowdown or inefficiency in the performance of infrastructure services.
3. **Security Breaches**: Any unauthorized access, data breaches, or vulnerabilities that compromise the integrity, confidentiality, or availability of the infrastructure.
4. **Operational Failures**: Failures in deployment pipelines, configuration management, or automated processes that hinder the normal operations of partner teams.
5. **Resource Exhaustion**: Over-utilization or exhaustion of compute, storage, or network resources leading to degraded service.
6. **Unexpected Behavior**: Any anomaly or unexpected behavior in infrastructure services that impacts the development, testing, or deployment activities of partner teams.

13:44:35 From Bianca Rivera Alvelo to Everyone: Replying to "Root Cause Label and..."

*Include instructions for escalating LHDI incidents (?)

13:44:55 From Kyle Brost to Everyone: Reacted to "Include instruction..." with ➕ 13:44:57 From Lisa Chung to Everyone: Reacted to "Include instruction..." with ➕ 13:47:31 From Megan Hicks to Everyone: Maybe under each incident type we have a defined escalation path 13:47:53 From Megan Hicks to Everyone: or as defined as it can be for iteration version 1 13:52:24 From Bianca Rivera Alvelo to Everyone: Replying to "Incident Response Pr..."

Step 3: SEV 3&4 - as soon as we can prioritize

13:55:10 From Bianca Rivera Alvelo to Everyone: Replying to "Incident Response Pr..."

*communicate SLA of 30min for Step 0: Acknowledge for partner team comms

13:56:23 From Berni Xiong (she/her) to Everyone: Gotta hop to an EE Team ceremony here shortly —made Bianca the host for now. Great discussion thanks all!

lisac commented 1 month ago

Initial thoughts on demo video... i'm not committed to this, but open to doing a dry run with it on my own so we can get a baseline.

questions

options

zoom recording, Slack recording, quicktime

script

  1. screen shows channel #benefits-vro-support
  2. cursor moves to the top section with the Incident Report link
  3. form is populated with these values:
    • name: [whoever is running the demo] (drawback: this won't be "evergreen")
    • team: VRO
    • short title: merge jobs are not being processed
    • describe the problem: Typically the app processes 5-10 merge jobs per hour. Since this morning, we're unable to find indication that the app has processed any merge jobs.
    • explain how you discovered this monitor: We monitor the number of merge jobs that are submitted and the number that are subsequently processed, and we have set up an alert if the volume is below a certain threshold. Here is a link to the monitor: https://monitor-url
    • Is your team's application non-functional? yes
    • Is this problem blocking your team? yes
    • Add any files or images: [upload a file named monitor-snapshot.png]
  4. Submit button is clicked
  5. resulting post is shown
  6. GitHub issue is shown
  7. while keeping focus on the post , in the background simulate a VRO engineer responding, with the emoji, and follow-up posts, continuing through completion with the Incident Report link - these updates will be visible on the post
bianca-rivera commented 1 month ago

@lisac +1 I think we should have two, one for partner teams and another for the VRO team. Documenting the responsibilities on both sides will ensure we follow consistent processes, especially as the integrations of Slack with PagerDuty and Github are new features that are not as familiar to everyone. I can simulate the partner team's side, and your script is perfect, my thoughts were exactly the same so thank you for writing it down.

(Ask: let's move these comments to #3218 since that this the ticket to track the work of developing the demo materials)

lisac commented 1 month ago

applied wiki page revisions via 7c677ef.

snapshot of changes to the draft from what was previewed to team on 8/5: https://github.com/department-of-veterans-affairs/abd-vro/wiki/_draft_-Incident-Response/_compare/294bce881494da2fde26e3dc4294f26c4a8fc08b...df928422c8c78fb00dfefc39230ed99b68667597

BerniXiongA6 commented 1 month ago

HI @lisac is this considered done and can we close this yet?