Open anniebtran opened 5 months ago
Slack thread started with Lighthouse here
What we know after some investigation:
AppealSubmissionUploads
without LH IDs — makes sense because the error happened before we update these records with those IDsAppealsApi::EvidenceSubmission
and VBADocuments::UploadSubmission
records both get created on every job run/retry, so we at least know we're reaching this point in the Lighthouse code but are still trying to track down whether there are errors that aren't getting bubbled up before the ensure
block with the .close
gets calledUpdate based on conversation with LH in the thread linked above:
VBADocuments::UploadSubmission
records likely have an expired
status because we failed to upload successfully to the file within the 15 min time limit set for the upload URLNext step — ask Platform team to check if failed upload files exist in S3 and general info about them (e.g. size, etc) and why they might fail to upload to Lighthouse. They can find the AppealSubmission via the user_uuid in the "Evidence upload to Lighthouse job failures" logs, and from there get the AppealSubmissionUploads with nil
Lighthouse IDs to figure out which uploads failed.
Update: Eugene got their AWS S3 access back and we checked to see if those failed uploads had files that still existed in the bucket, and unfortunately we didn't see them in there, so they must have gotten deleted somehow 🥴 Not sure what next steps should be on this
@saderagsdale not sure if you wanna follow up on this, but just wanted to make sure you're aware of this update ^
Sade will add this to the mini-remediation list and share with VBA
Discuss auditing process in bug triage.
Just confirming the status of this based on the question from standup — these files were not orphaned because there are AppealSubmissionUpload
records that tie the files back to the appeal itself. The issue is that the files don't exist in our S3 bucket (from comment above) so we can't manually try to re-upload to Lighthouse.
@saderagsdale I know there was a note on Jan 29 that says you added it to the mini-remediation list — I wasn't sure where that was so just in case, I added the Supplemental Claim UUID for this incident to the Sharepoint file that had the SCs affected by the AWS outage (https://github.com/department-of-veterans-affairs/va.gov-team/issues/72199) and renamed the file to "Supplemental Claims with evidence upload issues"
What happened?
On Jan 15th, we got an alert in our notifications channel about some evidence upload jobs (
DecisionReview::SubmitUpload
) failing permanently (retries exhausted). We were able to find the job ids for the evidence uploads that failed and some logs related to the failures (linked below).Upon investigation, we were able to find the uploads that were not submitted to Lighthouse. We should be able to resubmit the evidence uploads by re-running the jobs on the AppealSubmissionUpload records associated with the user's AppealSubmission (they only have one, it's for SC, and is from Jan 15) that are missing Lighthouse ID references
What do we need to know?
What caused the errors, how we can make these errors more easily visible for triaging.
What we've tried in order to debug the error / what we do know
We tried to re-run the job manually for one of the AppealSubmissionUploads, but we were interrupted by an incident on production that restarted the pods and possibly impacted the ability for the job to run properly, so we will likely need to manually queue up the job again and finish the remaining uploads
Action items
Slack threads and background context
Definition of Ready
Definition of Done
Out of scope