[bug spike] SC upload failures (1/15/24)

anniebtran commented 5 months ago

What happened?

On Jan 15th, we got an alert in our notifications channel about some evidence upload jobs (DecisionReview::SubmitUpload) failing permanently (retries exhausted). We were able to find the job ids for the evidence uploads that failed and some logs related to the failures (linked below).

Upon investigation, we were able to find the uploads that were not submitted to Lighthouse. We should be able to resubmit the evidence uploads by re-running the jobs on the AppealSubmissionUpload records associated with the user's AppealSubmission (they only have one, it's for SC, and is from Jan 15) that are missing Lighthouse ID references

What do we need to know?

What caused the errors, how we can make these errors more easily visible for triaging.

What we've tried in order to debug the error / what we do know

We tried to re-run the job manually for one of the AppealSubmissionUploads, but we were interrupted by an incident on production that restarted the pods and possibly impacted the ability for the job to run properly, so we will likely need to manually queue up the job again and finish the remaining uploads

Action items

[ ] Timebox investigation

Slack threads and background context

Definition of Ready

[ ] Timebox is defined
[ ] Nature of the issue defined

Definition of Done

[ ] Figure out what caused the error
[ ] Determine ways to make the error more easily visible (not silent)

Out of scope

Coding solution

anniebtran commented 5 months ago

Slack thread started with Lighthouse here

What we know after some investigation:

There are AppealSubmissionUploads without LH IDs — makes sense because the error happened before we update these records with those IDs
AppealsApi::EvidenceSubmission and VBADocuments::UploadSubmission records both get created on every job run/retry, so we at least know we're reaching this point in the Lighthouse code but are still trying to track down whether there are errors that aren't getting bubbled up before the ensure block with the .close gets called

Update based on conversation with LH in the thread linked above:

The VBADocuments::UploadSubmission records likely have an expired status because we failed to upload successfully to the file within the 15 min time limit set for the upload URL

anniebtran commented 5 months ago

Next step — ask Platform team to check if failed upload files exist in S3 and general info about them (e.g. size, etc) and why they might fail to upload to Lighthouse. They can find the AppealSubmission via the user_uuid in the "Evidence upload to Lighthouse job failures" logs, and from there get the AppealSubmissionUploads with nil Lighthouse IDs to figure out which uploads failed.

anniebtran commented 5 months ago

Update: Eugene got their AWS S3 access back and we checked to see if those failed uploads had files that still existed in the bucket, and unfortunately we didn't see them in there, so they must have gotten deleted somehow 🥴 Not sure what next steps should be on this

anniebtran commented 5 months ago

@saderagsdale not sure if you wanna follow up on this, but just wanted to make sure you're aware of this update ^

saderagsdale commented 5 months ago

Sade will add this to the mini-remediation list and share with VBA

saderagsdale commented 4 months ago

Discuss auditing process in bug triage.

anniebtran commented 4 months ago

Just confirming the status of this based on the question from standup — these files were not orphaned because there are AppealSubmissionUpload records that tie the files back to the appeal itself. The issue is that the files don't exist in our S3 bucket (from comment above) so we can't manually try to re-upload to Lighthouse.

anniebtran commented 4 months ago

@saderagsdale I know there was a note on Jan 29 that says you added it to the mini-remediation list — I wasn't sure where that was so just in case, I added the Supplemental Claim UUID for this incident to the Sharepoint file that had the SCs affected by the AWS outage (https://github.com/department-of-veterans-affairs/va.gov-team/issues/72199) and renamed the file to "Supplemental Claims with evidence upload issues"

department-of-veterans-affairs / va.gov-team