Open zachgoldfine44 opened 1 year ago
Adrian will start with copy-paste of platform process and share out for review.
< redacted older version >
Next up: Adrian to create word doc and share out for async review.
Next step: at least Steve, Zach, and Patrick V (and others) review doc
Adrian: A few more folks looked through and commented. Seems like we have something to start with.
Biggest point of discussion: what does it mean for an app team to declare an incident?
I've edited the doc for clarification on incident declaration, at least for our initial pass.
Adrian: Awaiting for additional feedback, specifically regarding the priority matrix. Steve: 526 team has circulated a draft for how they handle their process for handling queue exhaustion. Can we plug what they've created into this process? Adrian: That might be too team/code specific, whereas this document will be a higher-level document to be used by Officers of the Watch. Zach: The document addresses the purpose of the 526 doc.
@va-albers will share the 526 team's draft for awareness and use by other teams.
Timeline for next review: Next Friday
The doc looks great. Per @zachgoldfine44 comment - I would like health and benefits to follow the same process for declaring an incident for sanity's sake.
I reviewed and left several comments, mainly around the area of "when do we trigger a CPI?" and "what looks different here if the MIM team is already involved?" Overall, within the scope of the OCTO/VA.gov team, this looks like a solid process.
For discussion today:
Adrian: Should we use the datadog incident functionality for "declaring incidents"?
From Patrick: Should we declare CPIs/HPIs for external service problems? Charles: Yes! Our documentation (Adrian's draft) should explain when to do this and how. Chris: Propose the standard operating procedure we come up with learns towards more transparency. @BillChapmanUSDS will ask for guidance from ECC on how to determine a major incident and when to trigger the process.
Meeting scheduled Thursday, 11/9 @ 10am ET.
Meeting on Thursday is a kick-off. (Bill will attend)
Getting guidance from ECC on what a major incident is. And how to integrate with big VA MIM process.
Decision made: We will use Datadog service incidents.
Check back in Wednesday 11/22
@acrollet Just an FYI - I added a new item to the success criteria for announcing new process at team of teams meeting. I know we had talked about making this it's own separate issue, but I thought it might fit more cleanly into this issue. If you think we need to break this out into it's own ticket just let me know.
I've submitted the document for application teams to be added to the platform documentation site, issue here: https://github.com/department-of-veterans-affairs/va.gov-team/issues/72087
Implementation ticket here: https://github.com/department-of-veterans-affairs/va.gov-team/issues/72111
The doc will be deployed this afternoon.
Charles to announce at upcoming State of OCTO event. And at Team of Teams. Rescheduling in process.
CY success criterion 3
Background
How do we set clear expectations what is expected of people (gov’t employees, contract teams, others perhaps) in regards to load testing, monitoring, and acting when something is behaving outside the bounds of normal?
Relevant slack conversation
Definition of Done
Draft process doc
https://dvagov-my.sharepoint.com/:w:/g/personal/adrian_rollett_va_gov/EY-JhFIzFOFOkBSvo5QE81ABtgVMxunyeS5f03ebynT49w?e=UDokDS