Open wesrowe opened 1 year ago
@jilladams, do you agree that we've determined that this Sidekiq monitoring is not our problem to solve?
I think so: Kristen Brown: https://dsva.slack.com/archives/CBU0KDSB1/p1681764035317609?thread_ts=1680891382.607399&cid=CBU0KDSB1
I'm a member of the team that supports the Lighthouse Forms API. Our PM is @Michael Hobson , and our Technical Lead is @Matt Kelly . Future improvements to the nightly job would be the domain of my team rather than the Platform. (We only needed the Platform's help in this issue because it was a problem stemming from an in-progress infrastructure migration.) Our engineers met today to make a technical plan for improvements to the job. Next we'll create tickets that will be prioritized and worked in our team.
Drew Fisher: https://dsva.slack.com/archives/CBU0KDSB1/p1682463423128589?thread_ts=1680891382.607399&cid=CBU0KDSB1
https://vajira.max.gov/browse/API-25991, https://vajira.max.gov/browse/API-25993 and https://vajira.max.gov/browse/API-25996 are all related to improvements we've got planned API-25991: Add Task Duration Logging to Nightly FormReloader Job API-25993: Split FormReloader Job in Multiple Smaller Jobs API-25996: Add Slack Alerts for FormReloader Failure Trend Tracking
Matt Kelly updated PW slack today that they are taking this on:
Hail friends! I wanted to drop y'all an update on Banana Peels work to identify and ameliorate long running jobs. We currently have a :git-pull-request: up with work that will break our most egregious Sidekiq worker up into more wieldy and atomic jobs. This will also provide the added benefit of increasing the granularity of our optics into job duration. Once these are up in prod, we will be patching up the :slack: notifications as well. I'll keep dropping breadcrumb updates here for anyone following the matter.
Description
User story
AS A PM/PO I WANT to be alerted when the nightly Sidekiq job fails to correctly refresh the form endpoint on vets-api/Lighthouse SO THAT we can quickly take manual steps to recover.
Engineering notes / background
Migration of sidekiq jobs to EKS on 4/6 may be the culprit this time.
Slack thread with a Platform team (Kristen Brown) about what's going on.
Analytics considerations
Quality / testing notes
Acceptance criteria