department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
282 stars 203 forks source link

Process when abnormalities detected (success criterion 3 / stretch goal) #68095

Open zachgoldfine44 opened 1 year ago

zachgoldfine44 commented 1 year ago

CY success criterion 3

When issues are detected, the team responsible for the application or feature causing the issues has a defined and documented process for prioritizing and resolving the issues and refining monitoring and alerting to improve signal-to-noise ratio

Background

How do we set clear expectations what is expected of people (gov’t employees, contract teams, others perhaps) in regards to load testing, monitoring, and acting when something is behaving outside the bounds of normal?

Relevant slack conversation

Definition of Done

Draft process doc

https://dvagov-my.sharepoint.com/:w:/g/personal/adrian_rollett_va_gov/EY-JhFIzFOFOkBSvo5QE81ABtgVMxunyeS5f03ebynT49w?e=UDokDS

zachgoldfine44 commented 1 year ago

Consider involving sitewide

zachgoldfine44 commented 1 year ago

do we need to change contractors for contractor responsibilities?

zachgoldfine44 commented 1 year ago

Adrian will start with copy-paste of platform process and share out for review.

acrollet commented 12 months ago

< redacted older version >

zachgoldfine44 commented 12 months ago

Next up: Adrian to create word doc and share out for async review.

acrollet commented 12 months ago

Word doc is here: https://dvagov-my.sharepoint.com/:w:/g/personal/adrian_rollett_va_gov/EY-JhFIzFOFOkBSvo5QE81ABtgVMxunyeS5f03ebynT49w?e=UDokDS

zachgoldfine44 commented 12 months ago

Next step: at least Steve, Zach, and Patrick V (and others) review doc

zachgoldfine44 commented 11 months ago

Adrian: A few more folks looked through and commented. Seems like we have something to start with.

Biggest point of discussion: what does it mean for an app team to declare an incident?

acrollet commented 11 months ago

I've edited the doc for clarification on incident declaration, at least for our initial pass.

mattpointzxer0 commented 11 months ago

Adrian: Awaiting for additional feedback, specifically regarding the priority matrix. Steve: 526 team has circulated a draft for how they handle their process for handling queue exhaustion. Can we plug what they've created into this process? Adrian: That might be too team/code specific, whereas this document will be a higher-level document to be used by Officers of the Watch. Zach: The document addresses the purpose of the 526 doc.

@va-albers will share the 526 team's draft for awareness and use by other teams.

Timeline for next review: Next Friday

lalexanderson-dsva commented 11 months ago

The doc looks great. Per @zachgoldfine44 comment - I would like health and benefits to follow the same process for declaring an incident for sanity's sake.

patrickvinograd commented 11 months ago

I reviewed and left several comments, mainly around the area of "when do we trigger a CPI?" and "what looks different here if the MIM team is already involved?" Overall, within the scope of the OCTO/VA.gov team, this looks like a solid process.

acrollet commented 11 months ago

For discussion today:

mattpointzxer0 commented 11 months ago

Adrian: Should we use the datadog incident functionality for "declaring incidents"?

From Patrick: Should we declare CPIs/HPIs for external service problems? Charles: Yes! Our documentation (Adrian's draft) should explain when to do this and how. Chris: Propose the standard operating procedure we come up with learns towards more transparency. @BillChapmanUSDS will ask for guidance from ECC on how to determine a major incident and when to trigger the process.

mattpointzxer0 commented 11 months ago

Meeting scheduled Thursday, 11/9 @ 10am ET.

zachgoldfine44 commented 11 months ago

Meeting on Thursday is a kick-off. (Bill will attend)

Getting guidance from ECC on what a major incident is. And how to integrate with big VA MIM process.

Decision made: We will use Datadog service incidents.

zachgoldfine44 commented 11 months ago

Check back in Wednesday 11/22

mheadd commented 10 months ago

@acrollet Just an FYI - I added a new item to the success criteria for announcing new process at team of teams meeting. I know we had talked about making this it's own separate issue, but I thought it might fit more cleanly into this issue. If you think we need to break this out into it's own ticket just let me know.

acrollet commented 10 months ago

I've submitted the document for application teams to be added to the platform documentation site, issue here: https://github.com/department-of-veterans-affairs/va.gov-team/issues/72087

acrollet commented 10 months ago

Implementation ticket here: https://github.com/department-of-veterans-affairs/va.gov-team/issues/72111

acrollet commented 10 months ago

The doc will be deployed this afternoon.

mheadd commented 9 months ago

Charles to announce at upcoming State of OCTO event. And at Team of Teams. Rescheduling in process.