department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
96 stars 70 forks source link

Forms API: improve nightly update job to leave data in previous day's state if interrupted #13241

Open wesrowe opened 1 year ago

wesrowe commented 1 year ago

Description

User story

AS A Veteran I WANT any failure or interruption of the nightly sidekiq job to leave data for each non-updated form the same as yesterday SO THAT I'm not told forms don't exist erroneously.

The nightly sidekiq job seems to be engineered such that it flags numerous PDFs as invalidPDF: true if it is stopped before completion. We need to investigate why this is possible – i.e., naively, one would expect it to check forms and update the invalidPDF field one form at a time, such that interruption would only affect one form.

Engineering notes / background

Analytics considerations

Quality / testing notes

Acceptance criteria

wesrowe commented 1 year ago

@dsasser, would you be interested in looking at the level of effort for this ticket? DaveC said first step is "could we do this, and how easily?", and then we could ask Platform team for permission. Pre-refinement here would probably mean looking at the code to see how it works, why it's overwriting yesterday's validPDF field, etc.

It's not a high priority for DaveC, so definitely only do this if you feel like it and it's pretty easy.

dsasser commented 1 year ago

@wesrowe as this is Ruby on Rails code, I'm not ideally equipped to answer this question with confidence. That said, some thoughts:

      if form.url.present? && (content = URI.parse(form.url).open)
        form.sha256 = get_sha256(content)
        form.valid_pdf = true
      else

In the first line, URI.parse(form.url).open performs an http get request on the Form PDF url, and fills the content variable with the response. If this fails for any reason, the following is performed:

      message = "#{self.class.name} failed to get SHA-256 hash from form"
      form_data = { form_name: form.form_name, form_url: form.url }.to_s

      Rails.logger.error("#{message}: #{form_data}", e)
      VAForms::Slack::Messenger.new({ class: self.class.name, message:, exception: e, form_data: }).notify!

      form.valid_pdf = false

Nightly Process Flow

Sidekiq nightly forms job runs at mightnight every night, calling VAForms::FormReloader

VAForms::FormReloader performs the graphql query, and for each form returned calls build_and_save_form. build_and_save_form calls update_sha256, which is where the form pdf fetch is attempted. If this fails, the validPdf flag will be set to TRUE.

Should the FormReloader swagger script be halted prematurely however, the script would not process any further forms, and therefore unprocessed forms will have the previous day's validPdf value set.

Further, I'm not sure how to improve this, as we can't know based solely off of the fetch response if the request was interrupted by networking glitch, or the form is really not present. In any case, if the fetch was not successful, Rails attempts to post a Slack message (slack channel destination is being investigated), and a Rails error is logged to the Rails logger service with the message "failed to get SHA-256 hash from form".

wesrowe commented 1 year ago

It sounds like you're confident the overall structure of this checking is correct.

I was assuming that the problem is an interrupted job based on the slack threads with the Lighthouse team as they troubleshot. But I don't trust my read of that thread. I thought they ruled out a network glitch though?

dsasser commented 1 year ago

It sounds like you're confident the overall structure of this checking is correct.

Yes fairly confident.

I was assuming that the problem is an interrupted job based on the slack threads with the Lighthouse team as they troubleshot. But I don't trust my read of that thread. I thought they ruled out a network glitch though?

I am basing this on conversations I had with Lighthouse developers who were troubleshooting the issue in a Zoom call. Our conclusion was that there was a networking glitch during the nightly forms processing which caused forms status to change.

wesrowe commented 1 year ago

Re the glitch – after one night it could have been a glitch. But after multiple nights of problems, I think the conclusion moved on to EKS/sidekiq interruptions. @jilladams, can you get us to certainty here?

jilladams commented 1 year ago

I'm not certain on my own. My next step would be to ask Kristen Brown, whose team was working on the Sidekiq issue with plans to further stabilize it, in that original thread with them, if that seems right? Also: if we're not totally sure about the Ruby code, time to bring in @jtmst .

wesrowe commented 1 year ago

Matt Kelly dropped into the PW slack channel with an update today:

Hail friends! I wanted to drop y'all an update on Banana Peels work to identify and ameliorate long running jobs. We currently have a :git-pull-request: up with work that will break our most egregious Sidekiq worker up into more wieldy and atomic jobs. This will also provide the added benefit of increasing the granularity of our optics into job duration. Once these are up in prod, we will be patching up the :slack: notifications as well. I'll keep dropping breadcrumb updates here for anyone following the matter.

wesrowe commented 1 year ago

Iceboxing for now, pending any reason to thaw it.