Closed bianca-rivera closed 1 month ago
(experience of filing a report and getting updates from VRO)
---START RECORDING---
Name:
[whoever is running the demo] (drawback: this won't be "evergreen")
Team:
VRO
Title:
Merge jobs are not being processed
Description:
Typically the app processes 5-10 merge jobs per hour. Since this morning, we've been unable to find any indication that the app has processed any merge jobs.
Explanation:
We monitor the number of merge jobs that are submitted and the number that are subsequently processed, and we have set up an alert if the volume is below a certain threshold. Here is a link to the monitor: https://monitor-url/
Is your team's application non-functional?
yes
Is this problem blocking your team?
yes
Add any files or images:
[upload a file named monitor-snapshot.png] <-- candidate file
---PAUSE RECORDING--- (VRO on-call script starts, ~5 seconds)
in case this is relevant: we updated the version of
setuptools
. that was about two weeks ago."
---PAUSE RECORDING --- (~2 minutes to allow for background activities )
Complete
button in the workflow---RESUME RECORDING---
---END RECORDING---
(shows how VRO processes an Incident Report)
so that the team knows that any reports created as part of this script are not real incidents:
---START RECORDING---
---PAUSE RECORDING --- (hold for signal from Script 1 actor that they're about to hit the Submit button)
---RESUME RECORDING---
observe slack notification of new messages in #benefits-vro-on-call . open channel #benefits-vro-on-call.
click on the Incident Report's Actions link to the message in #benefits-vro-support - this brings up the #benefits-vro-support channel
add :eyes: to the message in #benefits-vro-support channel
post to the thread: "We will provide detailed updates on our findings and actions taken every ~30 minutes until the issue is resolved."
on PagerDuty message in slack (this will be posted immediately above the Incident Report post): Acknowledge the incident. (button click)
🗣️ (voiceover): use the support channel to communicate with the incident reporter. Beware of oversharing in this channel - we want to minimize "noise". for discussion that might be more granular, use the #benefits-vro-on-call channel.
change to channel #benefits-vro-on-call and post debugging messages. (why: illustrate selectively using #benefits-vro-on-call vs #benefits-vro-support. )
cpu utilization on rabbitmq is 100%
it's usually 4%
in the past few days, we have had only one deployment across VRO apps: last night, for rabbitmq.
that deployment for rabbitmq was for a minor patch; it's not a big deal if we roll back.
going to try a rollback to 23eadf3, which was running in prod yesterday.
in the #benefits-vro-support channel message, post as a threaded comment:
- current status: we have traced the issue to a recent patch to the messaging queue. We are verifying that the issue is contained and will proceed with rolling back the patch.
- no changes to the estimated resolution time
- next update in 30 minutes if not sooner
return to channel #benefits-vro-on-call and post:
deployment to prod-test is complete. logs look good. deploying to prod.
deployment is complete. logs look improved.
logging shows that merge jobs are being processed.
Now that the incident has ben remediated, post an update to #vro-benefits-support
:
- current status: Merge jobs are now being processed. We believe the app is back to normal operating status and will be monitoring closely.
- since the last update, we have reverted the recent patch to the messaging queue and been monitoring the app logs.
- next update in 30 minutes
🗣️ : As the action items are completed, click the 'Next Step' button.
in channel #benefits-vro-on-call click on the Incident Report's Next Step
button
Incident Report displays new message, re: updating the GH Issue
Open the GH Issue ---PAUSE RECORDING ---
vro-team
, RC VRO
SEV
level: "Severity Level: SEV 1. This issue caused a delay in processing merge jobs, which is a core feature of the EE app. No data was corrupted."----RESUME RECORDING--- (to show updated GH page)
---PAUSE RECORDING ---
Incidents
epic----RESUME RECORDING--- (to show updated Incident epic)
Next Step
button 07-31-2024 | SEV 1 | VRO Team | Merge jobs are not being processed | #[github issue]
and
<details>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
videos have been added to the Process section of the Incident Report wiki page
Hi @lisac is this considered done and can it be closed yet?
Description
As a best practice when implementing new processes, there is a need to create a demo video for our new Incident Report workflow
Why
So that partner teams and new VRO engineers can reference this along with documentation as part of their onboarding materials
AC