department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 201 forks source link

Document Spool File Debugging & Cleanup Process #34451

Open LindseySaari opened 2 years ago

LindseySaari commented 2 years ago

Issue Description

See Related Issue. Document how to debug/cleanup records related to the spool file job if data is corrupted as seen on 12/21 for the 12/17 & 12/20 spool files

  1. Document the more than 1 day scenario (This will require multiple spool files)
  2. Document the duplicate record scenario and how to filter out which records should be processed.
  3. How to filter "legit" records from "duplicate" records
  4. Document how this happened in the first place (the validation code)
  5. Clean up the code so this scenario shouldn't happen in the future. We run the record re-encryption yearly, so this will likely happen again. (The validation on the SavedClaim::EducationBenefits model)

Slack thread with EDU team VSP internal discussion


Tasks

Acceptance Criteria

LindseySaari commented 2 years ago

Spool File / EducationBenefitClaim Reprocessing

Background

The CreateDailySpoolFiles job runs daily to gather EducationBenefitClaim records to be sent for processing. The records are formatted and grouped into different regions. The formatted files are then sent via SFTP to a TIMS team server for further processing.

What went wrong

At the beginning of the week of 12/21, the EDU & TIMS team reported that the spool files received on their end included claims that had already been processed.

On 12/17/21 & 12/20/21, the KmsEncryptionVerificationJob was run to verify record decryption via kms (so that we could fully remove the lockbox key). While the KmsEncryptionVerificationJob job was running, the job "marked" some of the EducationBenefitClaim records for re-processing again.

Under the hood, there's a daily Sidekiq job that cleans up EducationBenefitsClaim older than 2 months. EducationBenefitsClaim's are the records processed by the spool file job and have a belongs_to relationship to SavedClaim::EducationBenefits

When we ran the KmsEncryptionVerificationJob to update the verified_decryptable_at date, it recreated a SavedClaim::EducationBenefits EducationBenefitsClaim records (that had already been cleaned up) when this validation ran. The newly created records had a processed_at date of nil, which is what caused the CreateDailySpoolFiles job to pick up the illegitimate records and process them.

In order to solve this issue, we had to query for the legitimate records that should have been processed on 12/17 and 12/21 and separate them into their respective spool files.

We used the updated_at date on the EducationBenefitsClaim records to find the legitimate records for each day. We then set the processed_at date to nil, so that they'd be picked up/marked for reprocessing during the next (manual) spool file run.

Once the above was complete, we ran the CreateDailySpoolFiles job manually to send the correct spool files over to the EDU/TIMS team.

How to mark for reprocessing

If the incorrect records are collected and placed into the spool file, do the following:

To find legitimate records for a given day

TODO: verified_decryptable_at probably shouldn't be here because it was specific to our use case

records = EducationBenefitsClaim.where(updated_at: DateTime.new(2021,12,17).all_day).joins(:saved_claim).where("saved_claims.form_id IN (?) AND saved_claims.
verified_decryptable_at IS NOT NULL AND saved_claims.created_at IS NOT NULL", LIVE_FORM_TYPES)

Note: You may want to add .count to the query to see if this is a legimate number within a typical range (Typically between 1000-3000 records)

Update the processed_at date:

records.update_all(processed_at: nil)

Edit the spool file code

Open a shell in the container

docker exec -it vets-api bash
vi app/workers/education_form/create_daily_spool_files.rb

Edit line 35 to the following - records = EducationBenefitsClaim.where.not(created_at: DateTime.new(YOUR_YEAR, YOUR_MONTH, YOUR_DAY).all_day)

Update the spool file name here to the current date with 000000 as the timestamp

Comment out the if statement Lines 91-93 and 117

After the above, Re-run the spool file job: CreateDailySpoolFiles.new.perform

If everything checks out, undo your changes to the app/workers/education_form/create_daily_spool_files.rb

Verify with TIMS team

Verify that the TIMS team received the spool files. We believe that it hits a sftp server that developers have access to before the processing center can see the files.

You can also check the #vsa-education-logs Slack channel to verify that the spool files have been written. Any errors should also be recorded there.

Future Iterations

We may want to allow for the CreateDailySpoolFiles job to accept a date argument so that we can add a date filter if this scenario were to happen again.

LindseySaari commented 2 years ago

@rileyanderson @thilton-oddball Could you please review the docs I wrote up here please? Feel free to add/edit as you see fit. Thanks!