department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
98 stars 70 forks source link

Establish monitoring of nightly Find-a-form Sidekiq jobs #13216

Open wesrowe opened 1 year ago

wesrowe commented 1 year ago

Description

User story

AS A PM/PO I WANT to be alerted when the nightly Sidekiq job fails to correctly refresh the form endpoint on vets-api/Lighthouse SO THAT we can quickly take manual steps to recover.

Engineering notes / background

Migration of sidekiq jobs to EKS on 4/6 may be the culprit this time.

Slack thread with a Platform team (Kristen Brown) about what's going on.

Analytics considerations

Quality / testing notes

Acceptance criteria

Consider

  • Design / Accessibility reviews
  • Collab cycle requirements
  • Device sizes (mobile first)
  • Documentation updates / Change management < - Content model documentation
  • Testing notes
wesrowe commented 1 year ago

@jilladams, do you agree that we've determined that this Sidekiq monitoring is not our problem to solve?

jilladams commented 1 year ago

I think so: Kristen Brown: https://dsva.slack.com/archives/CBU0KDSB1/p1681764035317609?thread_ts=1680891382.607399&cid=CBU0KDSB1

I'm a member of the team that supports the Lighthouse Forms API. Our PM is @Michael Hobson , and our Technical Lead is @Matt Kelly . Future improvements to the nightly job would be the domain of my team rather than the Platform. (We only needed the Platform's help in this issue because it was a problem stemming from an in-progress infrastructure migration.) Our engineers met today to make a technical plan for improvements to the job. Next we'll create tickets that will be prioritized and worked in our team.

Drew Fisher: https://dsva.slack.com/archives/CBU0KDSB1/p1682463423128589?thread_ts=1680891382.607399&cid=CBU0KDSB1

https://vajira.max.gov/browse/API-25991, https://vajira.max.gov/browse/API-25993 and https://vajira.max.gov/browse/API-25996 are all related to improvements we've got planned API-25991: Add Task Duration Logging to Nightly FormReloader Job API-25993: Split FormReloader Job in Multiple Smaller Jobs API-25996: Add Slack Alerts for FormReloader Failure Trend Tracking

wesrowe commented 1 year ago

Matt Kelly updated PW slack today that they are taking this on:

Hail friends! I wanted to drop y'all an update on Banana Peels work to identify and ameliorate long running jobs. We currently have a :git-pull-request: up with work that will break our most egregious Sidekiq worker up into more wieldy and atomic jobs. This will also provide the added benefit of increasing the granularity of our optics into job duration. Once these are up in prod, we will be patching up the :slack: notifications as well. I'll keep dropping breadcrumb updates here for anyone following the matter.