department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
277 stars 194 forks source link

[bug spike] SC upload failures (1/15/24) #74017

Open anniebtran opened 5 months ago

anniebtran commented 5 months ago

What happened?

On Jan 15th, we got an alert in our notifications channel about some evidence upload jobs (DecisionReview::SubmitUpload) failing permanently (retries exhausted). We were able to find the job ids for the evidence uploads that failed and some logs related to the failures (linked below).

Upon investigation, we were able to find the uploads that were not submitted to Lighthouse. We should be able to resubmit the evidence uploads by re-running the jobs on the AppealSubmissionUpload records associated with the user's AppealSubmission (they only have one, it's for SC, and is from Jan 15) that are missing Lighthouse ID references

What do we need to know?

What caused the errors, how we can make these errors more easily visible for triaging.

What we've tried in order to debug the error / what we do know

We tried to re-run the job manually for one of the AppealSubmissionUploads, but we were interrupted by an incident on production that restarted the pods and possibly impacted the ability for the job to run properly, so we will likely need to manually queue up the job again and finish the remaining uploads

Action items

Slack threads and background context

Definition of Ready

Definition of Done

Out of scope

anniebtran commented 5 months ago

Slack thread started with Lighthouse here

What we know after some investigation:

Update based on conversation with LH in the thread linked above:

anniebtran commented 5 months ago

Next step — ask Platform team to check if failed upload files exist in S3 and general info about them (e.g. size, etc) and why they might fail to upload to Lighthouse. They can find the AppealSubmission via the user_uuid in the "Evidence upload to Lighthouse job failures" logs, and from there get the AppealSubmissionUploads with nil Lighthouse IDs to figure out which uploads failed.

anniebtran commented 5 months ago

Update: Eugene got their AWS S3 access back and we checked to see if those failed uploads had files that still existed in the bucket, and unfortunately we didn't see them in there, so they must have gotten deleted somehow 🥴 Not sure what next steps should be on this

anniebtran commented 5 months ago

@saderagsdale not sure if you wanna follow up on this, but just wanted to make sure you're aware of this update ^

saderagsdale commented 5 months ago

Sade will add this to the mini-remediation list and share with VBA

saderagsdale commented 4 months ago

Discuss auditing process in bug triage.

anniebtran commented 4 months ago

Just confirming the status of this based on the question from standup — these files were not orphaned because there are AppealSubmissionUpload records that tie the files back to the appeal itself. The issue is that the files don't exist in our S3 bucket (from comment above) so we can't manually try to re-upload to Lighthouse.

anniebtran commented 4 months ago

@saderagsdale I know there was a note on Jan 29 that says you added it to the mini-remediation list — I wasn't sure where that was so just in case, I added the Supplemental Claim UUID for this incident to the Sharepoint file that had the SCs affected by the AWS outage (https://github.com/department-of-veterans-affairs/va.gov-team/issues/72199) and renamed the file to "Supplemental Claims with evidence upload issues"