department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 204 forks source link

[ISSUE] EVSS_15004 file type errors from EVSS #60157

Closed va-albers closed 7 months ago

va-albers commented 1 year ago

Instructions

Please fill out the form below, providing as much detail as possible.

Issue Description

EVSS Responds to EVSS::DisabilityCompensationForm::SubmitUploads calls with message: EVSS::ErrorMiddleware::EVSSError { "key"=>"EVSS_15004", "text"=>"We were unable to complete your request because your file type is not a supported format. Try using a different file type such as PDF (unlocked), JPEG, BMP or TXT. If you still have problems, call the VA at 1-800-827-1000 (711, if you use TDD), Monday - Friday. 8:00 am - 9:00 pm ET. When you call, provide the following error code: EVSS_15004.", "severity"=>"ERROR" }

Although the EVSS system suggests using extensions like "TXT", there are examples of file names that end with "txt" that fail (see file upload error Sentry log

It appears that the system will repeatedly retry these pages, although the error isn't one that can be fixed by replay? For instance the issue listed above notes that it is on retry_count 13.

Information needed to assess the severity of the issue

Information that aids in the research and troubleshooting of the issue

Any Additional Information:

tblackwe commented 1 year ago

The errors show that txt files were attempted to upload. We will want to research if the Backend prevents various txt file encodings from being submitted. We have what we need to research further.

RakshindaAslam commented 1 year ago

cc @tblackwe - need the following for review Impact summary

tblackwe commented 1 year ago

This is a silent failure for claims and needs to be a high priority fix.

We need to determine a mechanism to catch this whenever possible inside the vets-api submission. even if we can catch a percentage of the occurrences it will be an improvement.

@austinjprice please add it to this sprint's backlog

austinjprice commented 1 year ago

@tblackwe added to next items on board for discussion tomorrow refinement

NB28VT commented 1 year ago

Ok took a close look at the ticket, I think I have plenty to dig into this. Basic plan of attack will be:

Only question I would have now would be:

kylesoskin commented 1 year ago

and why the retry count is so high (14) 14 is the number of retires that most closely gets to a 24 hour period, with sidekiq's default retry logic. It actually may be 13, which is what the Submission retry is set to, but was originally set to 14 for some reason and never changed.

The submission job has logic to detect if the error falls into certain types of known failure cases that are known to not be able to succeed, no matter how many retries it does, and not retry in those cases (immediately send backup submission).

The retries I think are or were meant to smooth out transient issues such as network issues or outages, that are assumed to be external and resolved eventually (like downstream systems being down).

However I do not think the upload job has this same logic.

Track down any vets-api validation of the document's doc type, and if we have none locate where we would add them.

@Mottie would know where this is happening probably.

do we know who we would talk to on the EVSS team that could share the validation on their side and what doc types they disallow that will throw this error?

@kyle-drewno is who I was told was taking over the EVSS position from the developer I used to most closely work with at EVSS.

Is this potentially documented anywhere?

There is a EVSS 526 endpoint validation schema that I was sent at some point. It may have this info on it.

Mottie commented 1 year ago

The frontend checks that the file extension matches the file signature in the FileField component, but txt files don't have any signature (except for Windows BOM, see screenshot - src), so they are not checked; but the txt file MIME-type is checked on the frontend.

Screenshot 2023-07-14 at 1 54 49 PM

This screenshot shows a hex editor examining a tiny_file.txt file (4 bytes total size) on my Mac

Screenshot 2023-07-14 at 2 00 20 PM

The backend is using mimemagic, so maybe there's a bug in that gem?

NB28VT commented 1 year ago

Wow thanks @kylesoskin and @Mottie this is all very helpful

NB28VT commented 1 year ago

Documenting notes from a discussion with @SamStuckey on this, since his logging work meant he had some insights/theories on how the documents flow works globally on our side.

Insights:

Suggestions:

NB28VT commented 1 year ago

We ended up closing this related ticket around rejections for files with too much text, as it is having a relatively small impact. But noting it here as it's related.

RakshindaAslam commented 1 year ago

cc @aurora-a-k-a-lightning @sethdarragile6 @navaeddie @mchae-nava @christinec-fftc - fyi

NB28VT commented 1 year ago

PS may be worth looking at the mimemagic logic in the claim document uploader, as I think that is obscuring our validation strategy to a certain extent.

NB28VT commented 1 year ago

@kylesoskin I forgot to ask - if we can't track down documentation/a subject matter expert on this EVSS endpoint, is it possible for us to inspect the validation code on their side ourselves? Is it in a public repo?

NB28VT commented 1 year ago

Posting some additional new findings/an update from a very helpful discussion with @va-albers and @SamStuckey at the end of last week.

Main Questions to Answer -How does our file type validation differ in detail from the EVSS side? -Do we want to be retrying these failures 14 times or is this causing issues on the EVSS side? -Is it possible the issue with large file uploads is potentially throwing this error as well? -Why did certain specific files get rejected?

TODO/My play of attack

Findings Summary

-We have a few upload issues between our system and EVSS, including an unrelated issue with different file size limits. Steve said we accept 150 MB files but anything larger than 50 MB on the EVSS side causes them to be split up, a process that can fail/become overloaded if we are constantly trying and retrying to upload. The 150MB size limit on our side is meant to make it easier for vets who may not know how to compress files. -According to Kyle Drewno from Lighthouse, "Here's how eBenefits describes the allowable file types: PDF (unlocked), GIF, TIFF, TIF, JPEG, JPG, BMP, and TXT. What they don't mention is that TXT files must be UTF-8 encoded otherwise you get the EVSS_15004 error."

NB28VT commented 1 year ago

Providing an update here:

@SamStuckey and I ended up discussing this during an unrelated Slack call this morning, and there are a couple potential courses of action (Sam correct me if I misunderstood):

  1. We manually re-encode txt files before sending them to EVSS. Sam had some concerns around this: we are potentially risking manipulating the contents of that personal data, we’re taking on some liability in reading and re-writing it, and we may need to require a user check a “release” permission type box to get this past the collab cycle.

  2. We can detect an invalid encoding immediately when the user attempts to upload the file and raise an error. This wouldn’t be a great UX but we could explain to them how to either re-encode the file properly, or more likely, export and upload it as a PDF with clear instructions on how to do so. Sam was leaning more strongly in this direction and I agree.

NB28VT commented 1 year ago

Noting there are two roughly adjacent errors we may want to consider addressing in a single epic alongside this one:

  1. Rejections for corrupted files

  2. Large files that EVSS needs to split potentially overwhelming their system. We currently cap uploads at 50MB as EVSS does, but for PDFs we allow files as large as 150MB because EVSS has a system to split them. However, this can overload their system and if some of them fail and we attempt to retry too many times, it can even bring that part of the system down (per @va-albers , not aware of a ticket but feel free to link)

sortizsh commented 1 year ago

Austin will make an epic for the above additional bugs today. Need to sync with Julie and Nichole to understand how this will affect other UX.

sortizsh commented 7 months ago

this was broken into smaller tickets