department-of-veterans-affairs / abd-vro

To get Veterans benefits in minutes, VRO software uses health evidence data to help fast track disability claims.
Other
18 stars 6 forks source link

Create Oncall Documentation #2549

Closed agile-josiah closed 1 month ago

agile-josiah commented 5 months ago

User Story

As a VRO engineer, I want oncall responsibilities to be documented, so that partner teams that require support and oncall vro engineers have a reference for how to get/give support.

Work with Bianca on the comms aspect of this, including how we roll out the comms, as well as the cadence for getting partner team feedback.

Acceptance Criteria

  1. Updated documentation defining the oncall cadence and structure in the wiki.
  2. Details of comms rollout TBD
  3. Determine cadence of getting partner team feedback

Not included in this work

Fully defining the oncall process. This should be an MVP that is iterated upon when unknown or new issues for support become relevant. (ie communicating with LHDI or triaging a pod in the k8s cluster.)

OUT OF SCOPE: Incident response process, anything to do with monitoring and alerting, disaster recovery plan. All of these are important but not in scope for this.

Notes about work Could use a spike to get SLO/SLA and research could be useful from @bianca-rivera to know the needs of our partner teams as well as what VRO engineers are willing to support, and meet somewhere in the middle.

We fully expect that this documentation will need to be fleshed out as we develop an incident response plan, a disaster recovery plan, etc.

Tech Spec reference

bianca-rivera commented 5 months ago

Drafting the following documentation:

1. Service Blueprint: On-Call Use Case - documents the process of on-call support by outlining the internal and external (partner teams) actions/steps, technology used, and information shared (TBD: could also collect our ideas for SLO/SLAs) Ticket #2612

Collaboration: working session after daily sync (date TBD)

2. Comms Strategy - generate copy for rollout announcement Ticket #2586

Collaboration: async editing and commenting

Note: This documentation is a first iteration and is meant to be edited and updated as changes are made, especially as we develop an incident response plan and disaster recovery plan.

meganhicks commented 1 month ago

We decided to try out our new incident response plan for a sprint and prioritize the communication piece next Sprint.