Open raywangoctova opened 2 years ago
Per discussion at the onsite workshop for 2023 platform objectives, Cory indicated that he would follow up on this project request for the next steps.
Per request:
There is some decent information around the request and with the general notion of Platform being compliant, that should add some clear / concrete boundaries to work within. As this ticket is currently written, there is a lot of unknown(s) from a Platform team perspective so a fair amount of discovery will be required to flush out the specifics, to get educated and get detailed on what items Platform needs and actions they should take to get there. This feels like a pretty good size effort and needs very clear MVPs identified and defined. I believe the AC is a bit high level and needs more added to it. For example, I would assume Platform documentation would need to be updated and communicated to Platform and VFS teams. I would also say conducting a tabletop exercise shouldn’t be an AC but making part of the process and possibly even scheduling the tabletop could (shouldn’t be holding the item success based on a future event that there is no control over).
So to me, step #1 is to put a sprint of Discovery into play and work to iron out and harden the specifics, deliverables, etc. The output should be a clear path forward with very clear and specific AC with small bitesize and testable deliveries.
I don't really understand this whole process, but do think I understand what the A/C should be. I feel like there would be fair bit of discovery that would happen along the way as the playbook is updated.
This process is being formulated as part of the code Yellow, in collaboration with the ECC. Datadog monitor and ServiceNow integration is in place so that datadog alerts with Priority P1 (SEV-1 classification in SNOW) are available in ECC. Still there are few questions and processing pending discussion and process refinement on how many of these P1 alerts are to be handled through CPI/HPI MIM process of OIT.
Here is the doc that @acrollet has put together - https://dvagov-my.sharepoint.com/:w:/g/personal/adrian_rollett_va_gov/EY-JhFIzFOFOkBSvo5QE81ABtgVMxunyeS5f03ebynT49w?e=DX1Con
Current status: We have an initial POC to integrate Datadog monitors into SNOW for usage by ECC Event Management. Next followup steps needed are: Update alert message with required response action (contact Platform oncall? declare MIM?) Configure alerting integration for all mission critical monitors (all P1/P2?) Run tabletop exercise with ECO/EM/OTG to confirm successful incident response (make sure everyone has necessary access & understands procedures)
@AparnaNittalaUSDS to check what has been done and what is remaining with the OIT integration.. outline items like documentation, tagging standards etc
Can anyone comment on where we are with this effort? @JeffKeeneVAGov @BillChapmanUSDS
LOE
Large -- there are a lot of unknowns and SNOW seems like a very complex system that is reliant on information being input in very specific ways (based on our call and demo with a power user of SNOW). We will have to start out with discovery that gives us more information about how to integrate. We also need some information from OCTODE about when escalations are necessary
Problem Statement
User Impact
This project will impact the platform team's overall response to any major incidents on the VA.gov platform.
Where was this problem reported?
How well do we understand the problem?
From Rod Kearns
Additional discovery is needed with the IPM and Platform teams about process integration, dependencies, and handoffs between the MIM process and VA.gov Incident Response Playbook.
Current documentation:
What is the acceptance criteria?
How should we measure success?
TODOs