department-of-veterans-affairs / abd-vro

To get Veterans benefits in minutes, VRO software uses health evidence data to help fast track disability claims.
Other
19 stars 6 forks source link

Create demo materials for Incident Report workflow #3218

Closed bianca-rivera closed 1 month ago

bianca-rivera commented 2 months ago

Description

As a best practice when implementing new processes, there is a need to create a demo video for our new Incident Report workflow

Why

So that partner teams and new VRO engineers can reference this along with documentation as part of their onboarding materials

AC

lisac commented 1 month ago

video 1 of 2: video for partner teams

(experience of filing a report and getting updates from VRO)

---START RECORDING---

  1. Screen shows Slack window with #benefits-vro-support channel selected
  2. Cursor moves to the bookmark section at the top of the channel and clicks on the Incident Report button
  3. Pop-up window of the Incident Report form is populated with these values:

Name:

[whoever is running the demo] (drawback: this won't be "evergreen")

Team:

VRO

Title:

Merge jobs are not being processed

Description:

Typically the app processes 5-10 merge jobs per hour. Since this morning, we've been unable to find any indication that the app has processed any merge jobs.

Explanation:

We monitor the number of merge jobs that are submitted and the number that are subsequently processed, and we have set up an alert if the volume is below a certain threshold. Here is a link to the monitor: https://monitor-url/

Is your team's application non-functional?

yes

Is this problem blocking your team?

yes

Add any files or images:

[upload a file named monitor-snapshot.png] monitor-snapshot.png <-- candidate file

---PAUSE RECORDING--- (VRO on-call script starts, ~5 seconds)

  1. Click on Submit button.
  2. Form window disappears, and channel #benefits-vro-support is visible again
  3. Workflow automated message of incident report is shown as the latest message
  4. Keep focus on the incident report message (Step 6 of VRO on-call script initiates in the background)
  1. Reply to the thread:

in case this is relevant: we updated the version of setuptools. that was about two weeks ago."

  1. Keep focus on the incident report message (Step 7 of VRO on-call script initiates in the background)

---PAUSE RECORDING --- (~2 minutes to allow for background activities )

---RESUME RECORDING---

  1. Goes to another channel to simulate time passing (~5 seconds)
  2. Gets notification of reply to thread on incident report message on #benefits-vro-support
  3. Goes to message, and sees automated message the incident has been resolved

---END RECORDING---

lisac commented 1 month ago

video 2 of 2: video for VRO team

(shows how VRO processes an Incident Report)

prep

so that the team knows that any reports created as part of this script are not real incidents:

draft script

---START RECORDING---

  1. start on #benefits-vro
  2. 🗣️ (voiceover): Incident Reports are routed to the #benefits-vro-on-call channel. The reported details are used to create a GitHub Issue and PagerDuty incident. The Incident Report post lists immediate action items.

---PAUSE RECORDING --- (hold for signal from Script 1 actor that they're about to hit the Submit button)

---RESUME RECORDING---

  1. observe slack notification of new messages in #benefits-vro-on-call . open channel #benefits-vro-on-call.

  2. click on the Incident Report's Actions link to the message in #benefits-vro-support - this brings up the #benefits-vro-support channel

  3. add :eyes: to the message in #benefits-vro-support channel

  4. post to the thread: "We will provide detailed updates on our findings and actions taken every ~30 minutes until the issue is resolved."

  5. on PagerDuty message in slack (this will be posted immediately above the Incident Report post): Acknowledge the incident. (button click)

  6. 🗣️ (voiceover): use the support channel to communicate with the incident reporter. Beware of oversharing in this channel - we want to minimize "noise". for discussion that might be more granular, use the #benefits-vro-on-call channel.

  7. change to channel #benefits-vro-on-call and post debugging messages. (why: illustrate selectively using #benefits-vro-on-call vs #benefits-vro-support. )

    cpu utilization on rabbitmq is 100%
    it's usually 4%
    in the past few days, we have had only one deployment across VRO apps: last night, for rabbitmq.
    that deployment for rabbitmq was for a minor patch; it's not a big deal if we roll back.
    going to try a rollback to 23eadf3, which was running in prod yesterday.
  8. in the #benefits-vro-support channel message, post as a threaded comment:

  • current status: we have traced the issue to a recent patch to the messaging queue. We are verifying that the issue is contained and will proceed with rolling back the patch.
  • no changes to the estimated resolution time
  • next update in 30 minutes if not sooner
  1. return to channel #benefits-vro-on-call and post:

    deployment to prod-test is complete. logs look good. deploying to prod.
    deployment is complete. logs look improved.
    logging shows that merge jobs are being processed. 
  2. Now that the incident has ben remediated, post an update to #vro-benefits-support:

    • current status: Merge jobs are now being processed. We believe the app is back to normal operating status and will be monitoring closely.
    • since the last update, we have reverted the recent patch to the messaging queue and been monitoring the app logs.
    • next update in 30 minutes
  3. 🗣️ : As the action items are completed, click the 'Next Step' button.

  4. in channel #benefits-vro-on-call click on the Incident Report's Next Step button

  5. Incident Report displays new message, re: updating the GH Issue

  6. Open the GH Issue ---PAUSE RECORDING ---

    • add labels vro-team, RC VRO
    • add the issue to VRO Sprint board
    • assign the engineers
    • add note about the SEV level: "Severity Level: SEV 1. This issue caused a delay in processing merge jobs, which is a core feature of the EE app. No data was corrupted."

----RESUME RECORDING--- (to show updated GH page)

---PAUSE RECORDING ---

----RESUME RECORDING--- (to show updated Incident epic)

  1. in channel #benefits-vro-on-call click on the Incident Report's Next Step button
  2. click the link to the Incident Report wiki page
  3. add entry: 07-31-2024 | SEV 1 | VRO Team | Merge jobs are not being processed | #[github issue] and
    
    <details>
07-31-2024: VRO Team - Merge jobs are not being processed

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

``` ---PAUSE RECORDING --- (flag to Script 1 actor that you're about to click the `Complete` button) ---RESUME RECORDING --- 21. in channel #benefits-vro-on-call click on the Incident Report's `Complete` button
lisac commented 1 month ago

videos have been added to the Process section of the Incident Report wiki page

BerniXiongA6 commented 1 month ago

Hi @lisac is this considered done and can it be closed yet?