department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

Report Major VA.gov Incidents Following VA OIT's Major Incident Management (MIM) Process #48270

Open raywangoctova opened 1 year ago

raywangoctova commented 1 year ago

LOE

Large -- there are a lot of unknowns and SNOW seems like a very complex system that is reliant on information being input in very specific ways (based on our call and demo with a power user of SNOW). We will have to start out with discovery that gives us more information about how to integrate. We also need some information from OCTODE about when escalations are necessary

Problem Statement

User Impact

This project will impact the platform team's overall response to any major incidents on the VA.gov platform.

Where was this problem reported?

How well do we understand the problem?

From Rod Kearns

Additional discovery is needed with the IPM and Platform teams about process integration, dependencies, and handoffs between the MIM process and VA.gov Incident Response Playbook.

Current documentation:

What is the acceptance criteria?

How should we measure success?

TODOs

raywangoctova commented 1 year ago

Per discussion at the onsite workshop for 2023 platform objectives, Cory indicated that he would follow up on this project request for the next steps.

little-oddball commented 1 year ago

Per request:

There is some decent information around the request and with the general notion of Platform being compliant, that should add some clear / concrete boundaries to work within. As this ticket is currently written, there is a lot of unknown(s) from a Platform team perspective so a fair amount of discovery will be required to flush out the specifics, to get educated and get detailed on what items Platform needs and actions they should take to get there. This feels like a pretty good size effort and needs very clear MVPs identified and defined. I believe the AC is a bit high level and needs more added to it. For example, I would assume Platform documentation would need to be updated and communicated to Platform and VFS teams. I would also say conducting a tabletop exercise shouldn’t be an AC but making part of the process and possibly even scheduling the tabletop could (shouldn’t be holding the item success based on a future event that there is no control over).

So to me, step #1 is to put a sprint of Discovery into play and work to iron out and harden the specifics, deliverables, etc. The output should be a clear path forward with very clear and specific AC with small bitesize and testable deliveries.

jwoodman5 commented 1 year ago

I don't really understand this whole process, but do think I understand what the A/C should be. I feel like there would be fair bit of discovery that would happen along the way as the playbook is updated.

AparnaNittalaUSDS commented 7 months ago

This process is being formulated as part of the code Yellow, in collaboration with the ECC. Datadog monitor and ServiceNow integration is in place so that datadog alerts with Priority P1 (SEV-1 classification in SNOW) are available in ECC. Still there are few questions and processing pending discussion and process refinement on how many of these P1 alerts are to be handled through CPI/HPI MIM process of OIT.

Here is the doc that @acrollet has put together - https://dvagov-my.sharepoint.com/:w:/g/personal/adrian_rollett_va_gov/EY-JhFIzFOFOkBSvo5QE81ABtgVMxunyeS5f03ebynT49w?e=DX1Con

annekerr49 commented 7 months ago

Current status: We have an initial POC to integrate Datadog monitors into SNOW for usage by ECC Event Management. Next followup steps needed are: Update alert message with required response action (contact Platform oncall? declare MIM?) Configure alerting integration for all mission critical monitors (all P1/P2?) Run tabletop exercise with ECO/EM/OTG to confirm successful incident response (make sure everyone has necessary access & understands procedures)

AparnaNittalaUSDS commented 5 months ago

@AparnaNittalaUSDS to check what has been done and what is remaining with the OIT integration.. outline items like documentation, tagging standards etc

chrisj-usds commented 2 months ago

Can anyone comment on where we are with this effort? @JeffKeeneVAGov @BillChapmanUSDS