department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

[bug] 526ez | Embedded multibyte characters in form data received can cause EVSS field length validation claim submission rejections #42476

Open michelpmcdonald opened 2 years ago

michelpmcdonald commented 2 years ago

Value Statement

As a developer I want to make the web forms as easy to use as possible for vets and bridge the gap between the modern va.gov website and the established va backend systems.
So that the vet's form editing-field validation is as simple and straight forward as possible and does not require knowledge of text encodings


Background Context

Ticket 41746 hit a subtile issue where the vet entered 500 characters into a 526 form field that has a max char limit of 500.

When the user hits this issue, the claim is rejected by EVSS so the claim never is established with the VA

The core issue is that the vet used multibyte non-ascii utf-8 single quotes in the field in the words I'm and don't. Visually, the text is 500 in length, but since the ' were utf-8 quotes, they use multi bytes in the underlying raw data.
Ascii characters(a,b,c,d ...) use one byte of underlying raw data.

The visual length of the field is 500, so it passed the web-site vets-api visual char length checks, but evss must check underlying raw byte length, and the underlying non-ascii quotes the vet put in the form caused the byte length of the field to come out to 504, 4 bytes too many as far as EVSS is concerned:

{
"messages" : [ {
"key" : "form526.veteran.homelessness.homelessnessRisk.otherLivingSituation.Size",
"severity" : "ERROR",
"text" : "size must be between 1 and 500"
} ]
}

Special, generally unknown key board sequences are required to to "type" non-ascii characters using a keyboard: [Smart Quotes](https://smartquotesforsmartpeople.com/, so i don't think the source is the vet typing directly from their keyboard into the webform.

Impact: We have seen this type of error ~8 times since June

Acceptance Criteria

Tasks

Definition of Ready


How to configure this issue

saderagsdale commented 2 years ago
saderagsdale commented 2 years ago

@Mottie to look into FE options and make a recommendation.

saderagsdale commented 2 years ago

Background context:

The source of the vets inputting multibyte characters is most likely a word processor or text editor the vet is using.
It's common for text editors-word processors to use utf-8 by default and automatically convert ascii user typed quotes into utf-8 multibyte left-quote right-quotes(smart quotes, see above). I think the users are copying the text from their editor application and pasting into the web form and the copy-paste brings in the non-ascii characters, it is kind of a known general issue referred to as "smart quotes" along with an accompanying eye roll from programmers....

To fix, the solution should not involve attempting to inform the user of the issue and directing them to self fix. I think that sort of approach is not going to understood by a lot of users.

Currently, on the web site, the field character length limit is enforced by not allowing the users type any additional characters into the form field over 500 characters, any characters the user attempts to type are ignored.

I do think our fix should be automatic and unobtrusive and most likely would not involve the front end.

The vets-api might be the place, and the fix would involve converting the multibyte characters into an ~equivalent ascii character that takes up one byte. The issue is two fold and there are no standard well know single approach solutions.

The first issue is identifying to current encoding, which is tough. For the most part, text bytes that represent characters don't have any standard for any kind of embedded meta data that specifics the specific text encoding(ascii, utf-8 windows-1252, mac roman are some common end user encoding I've seen here at the VA), text is just an underlying sequence of bytes and the specific encoding used to interpret the bytes into visual characters needs to be "known" by creators and users of the file. So, in theory, somebody supplying the text file would need to "tell" the recipient the encoding. In practice, most applications that deal with text files have embedded logic that try to figure out the encoding.....some users might experience an applications wrong encoding guess as some odd looking visual characters like Â’.

Once the encoding of the source is known, it is somewhat easy to convert common multibyte characters(for the most part, it's punctuation where I see this) down into single byte ascii. For example utf-8 left singe quote and right single quote can be downgraded into ascii single quote(ascii does not have left-right single quotes). But not all utf-8 characters map down that cleanly, they don't have an ascii single byte equivalent. An example I've hit here at the VA is the utf-8 section sign: §(it appears in va documents that vets copy paste into text files-fields). No ascii equivalent, and the ruby re-encoding functionality available in vets-api convert it to ascii "SS". That won't work with us for this case, because that takes one visual character and converts it into two ascii characters and does not actually reduce the size of the underlying bytes data.

So, there is no silver bullet solution here, going to be some work on this one.

Converting multibyte encoding down to single byte ascii is done in spots in vets-api in spots, here is an example, I don't think this will work in some cases thus the Encoding::UndefinedConversionError rescue: normailze_text

EVSS_TEXT_ENCODING = 'ascii' # EVSS only accepts text files written in ASCII
.......
.......
def normalize_text
    return unless file_name.match?(/\.txt$/i) && file_obj

    text = file_obj.read
    text = text.encode(EVSS_TEXT_ENCODING)
    file_obj.tempfile = Tempfile.new(encoding: EVSS_TEXT_ENCODING)
    file_obj.tempfile.write text
    file_obj.tempfile.rewind
  rescue Encoding::UndefinedConversionError
    errors.add(:base, I18n.t('errors.messages.uploads.ascii_encoded'))
  end

The issues with this code is that the initial file read assumes the format of the text file is utf-8 by default since the encoding is not specifically provided to the read call:

text = file_obj.read

The next line could fail or re-encode the text incorrectly since the encoding of the file is assumed to be utf-8. In some cases, the utf-8 assumption is correct and the "encode" to ascii will work, but if the text is NOT utf-8 the resulting text could be garbled with odd characters like the black dimond of death: �

So far, all the code example I've run across in vets api seem to have the same potential problem of not attempting to identify the current text encoding before re-encoding the text down into ascii single byte.

Here is some code I've been working on. I use this code to id the current file encoding using a heuristic approach(encode file in differ encoding, count valid specific punctuation chars, the encoding with the most valid punctuation chars wins) and then re-encode the file to ascii(EVSS only accepts ascii text file uploads)

# macRoman quotes
MR_LEFT_DOUBLE_QUOTE = 0xD2
MR_RIGHT_DOUBLE_QUOTE = 0xD3
MR_LEFT_SINGLE_QUOTE = 0xD4
MR_RIGHT_SINGLE_QUOTE = 0xD5
MR_SECTION_SIGN = 0xA4
MAC_ROMAN_QUOTES = [MR_LEFT_DOUBLE_QUOTE, MR_RIGHT_DOUBLE_QUOTE, MR_LEFT_SINGLE_QUOTE, MR_RIGHT_SINGLE_QUOTE, MR_SECTION_SIGN]

# Windows1252
W1252_LEFT_SINGLE_QUOTE = 0X91
W1252_RIGHT_SINGLE_QUOTE = 0X92
W1252_LEFT_DOUBLE_QUOTE = 0X93
W1252_RIGHT_DOUBLE_QUOTE = 0X93
W1252_SECTION_SIGN = 0XA7
W1252_QUOTES = [W1252_LEFT_SINGLE_QUOTE, W1252_RIGHT_SINGLE_QUOTE, W1252_LEFT_DOUBLE_QUOTE, W1252_RIGHT_DOUBLE_QUOTE, W1252_SECTION_SIGN]

#Single byte encodings
SINGLE_BYTE_ENCODINGS = {'Windows-1252': W1252_QUOTES, 'macRoman': MAC_ROMAN_QUOTES}

# UTF-XXXX bom marker
UTF_BOM = "\uFEFF"

# UTF encodings with bom
BOM_ENCODINGS = ["UTF-32BE", "UTF-32LE", "UTF-16BE", "UTF-16LE"]

def get_score(encoding_quotes, file_tally)
    encoding_quotes.reduce(0) { |sum, element| sum + file_tally.fetch(element, 0) }
end

# Checks for common bom encoded text files
# If BOM detected, returns specific bom encodibg, otherwise nil
def get_bom_encoding(file_data)
    encoding = BOM_ENCODINGS.find do |utf_encodibg|
        bom = "\uFEFF".encode(utf_encodibg)
        # grab the first unicode char, check if it's a bom  
        file_data.bytes.first(bom.bytesize).pack('c*').force_encoding(utf_encodibg) == bom
    end
end

# Checks for common single byte encoded text files
# Sums quotes in commonly seen single byte encodings, most quotes win
# If no quotes, nil
def get_single_byte_encoding(file_data)
    # Get count of each distinct byte
    byte_counts = file_data.bytes.tally

    encoding = nil
    quote_count = 0
    SINGLE_BYTE_ENCODINGS.stringify_keys.each do |target_encoding, encoding_quotes|
        score = get_score(encoding_quotes, byte_counts)
        if score > quote_count
            quote_count = score
            encoding = target_encoding
        end
    end
    encoding
end

def detect_set_encoding(file_data)
    # Detect current encoding
    detected_encoding = nil
    if (detected_encoding = get_bom_encoding(file_data))
        # set encoding and get rid of leading bom marker
        file_data.force_encoding(detected_encoding).slice!(0)
    elsif (detected_encoding = get_single_byte_encoding(file_data))
        file_data.force_encoding(detected_encoding)
    end
    file_data
end

def convert_to_ascii(file_data) 
    # start with common utf-8
    file_data.force_encoding("UTF-8")

    # if all bytes are already ascii only, standardize newlines and we are done
    return file_data.encode('US-ASCII', universal_newline: true) if file_data.ascii_only?

    # Detect current encoding
    file_data = detect_set_encoding(file_data)

    # now that the file encoding is known-set, convert to utf-8 because that
    # is the encoding to_ascii expects.  D
    file_data.encode!("UTF-8", universal_newline: true, invalid: :replace, undef: :replace, replace: "*")

    # encode does not convert Unicode quotes to ascii quotes,
    # since unicode quotes are the most common issue we find in txt file uploads
    # manually convert
    fallback = {"\u2019" => "'", "\u2018" => "'", "\u201C" => '"', "\u201D" => '"'}
    file_data.gsub!( /[\u2019\u2018\u201C\u201D]/, fallback)

    file_data.encode("US-ASCII", universal_newline: true, invalid: :replace, undef: :replace, replace: "*")
end

def test(file_name, expected_encoding)
    d = nil
    f = File.open(file_name)
    d = f.read
    x = detect_set_encoding(d)
    raise "**************\nError, expected #{expected_encoding}, got #{x.encoding}" if x.encoding.to_s != expected_encoding
    x = convert_to_ascii(d)
    #puts x
end

test('mac_roman_test_file.txt', 'macRoman')
test('windows_1252_test_file.txt', 'Windows-1252')
test('utf_8_test_file.txt', 'UTF-8')

I think any text form field with a max length limit that is enforced in EVSS has the potential to hit this issue.

I'm kind of wondering how many EVSS claim rejects are caused by this, for this specific field, I'm aware that it was hit 24 times in 2022 so far.....

scottymeyers commented 2 years ago

I believe we circled back and confirmed that this would be better handled on the back end? Sound right @Mottie ?

Mottie commented 2 years ago

Yes, @michelpmcdonald mentioned that the tools for the backend are much better at handling this situation

michelpmcdonald commented 2 years ago

This seems to be a rare issue, however, a small number of vets will most likely continue to hit it until we address. I suspect we should apply the fix to all 526 text fields that the users can edit

scottymeyers commented 2 years ago

I've got a script going using the popular rchardet gem, which is utilized by other gems used in vets-api, but not directly. This gem allows us to determine a strings encoding, with a certain confidence. So we could check if a supplied string isn't utf-8, and if our confidence is high enough, say 50% or higher, we could then convert it to utf-8. We'd need to likely seek permission to add this gem if we take this approach.

We're concerned about this particular field on this particular form, but like, where does it end? Should we apply this check to all fields within this form with a max-length? as this seems this error comes back from EVSS. Hopefully we can just concern ourselves with this particular field for the time being. But if this occurs again for other fields, we'll likely want to really consider similar treatment for all max-length'd fields for EVSS.

michelpmcdonald commented 2 years ago

I suspect we should apply the fix to all 526 text fields that the users can edit

saderagsdale commented 2 years ago

@scottymeyers @michelpmcdonald Hi both! This issue isn't listed in the last sprint planning doc or the current one. Does this require our attention now or are there other priority bugs/feature work that should take priority?

michelpmcdonald commented 2 years ago

So we could check if a supplied string isn't utf-8, and if our confidence is high enough, say 50% or higher, we could then convert it to utf-8.

To properly convert it, it's not enough to detect that it's not utf-8 encoding, we have to figure out it's current encoding. Also, we need to convert the text to ASCII before sending to EVSS.

Utf-8 is still multibyte, since EVSS counts bytes, a single utf-8 character could count as 2 bytes when evss counts bytes to determine the text length. That is the core problem, visually the user sees 1 character on the screen in the text field, but the actual data underlying the screen representation is two bytes, and since EVSS counts bytes, not "screen length", EVSS calculates a length of 2. To align the 2 different approaches of counting text length(the visible screen length of 1 VS the underlying byte count of 2) ultimately, we need to force the text back into single byte ASCII so that way the screen length of 1 will align with the byte count of 1.

scottymeyers commented 2 years ago

To properly convert it, it's not enough to detect that it's not utf-8 encoding, we have to figure out it's current encoding.

That's what I meant. Like if this gem says its more than say 50% confident that this string is some particular encoding, we can go ahead and consider it that encoding when we convert it to utf-8.

scottymeyers commented 2 years ago

@michelpmcdonald w/ rchardet:

require 'rchardet'

str = '“'
# str = 'some string' # ascii

result = str

# encode to something other than utf-8
new_str = str.encode('windows-1254')
char_det = CharDet.detect(new_str) # {"encoding"=>"windows-1252", "confidence"=>0.5}

if char_det['confidence'] >= 0.5 && char_det['encoding'] != 'utf-8'
  result = str.encode!(char_det['encoding']).encode!('utf-8')
end

puts "result: #{result}"
puts CharDet.detect(result)

Output:

{"encoding"=>"windows-1252", "confidence"=>0.5}
result: “
{"encoding"=>"utf-8", "confidence"=>0.505}
michelpmcdonald commented 2 years ago

Although this issue is pretty rare as far as I can tell, it is closely related to the much more common text file uploads issue: 37947

In that ticket, the users are uploading text files with multibyte character set encodings. No length limit on text file uploads, but EVSS rejects text files uploads with multibyte character sets, so the underlying solution of detecting the current encoding and downgrading to ascii only is common to both the text file uploads issues and this form field issue...

michelpmcdonald commented 2 years ago

we can go ahead and consider it that encoding when we convert it to utf-8.

just to be clear, our final target encoding is not utf-8, its ascii, ascii is a subset of utf-8, so another way to state our target is "utf-8 that only contains ascii characters"

saderagsdale commented 2 years ago

@scottymeyers @michelpmcdonald Pinging on this again:

Hi both! This issue isn't listed in the last sprint planning doc or the current one. Does this require our attention now or are there other priority bugs/feature work that should take priority?

michelpmcdonald commented 2 years ago

@saderagsdale detecting and then downgrading multibyte characters into ASCII that EVSS will accept comes into play in two known situations, in the forms and in supporting evidence text file uploads.

Multibyte characters in forms causing EVSS field length validation issues are pretty rare, it's come up as a production issue less than ~10 times that I'm aware of

However, Multibyte characters in supporting evidence uploads is a huge problem, to ~date, we have hit this around 1850 times.

The reason I'm associating these two issues is since they are both the same underlying issue, it might be appropriate to share the code that we end up creating to solve the issue(detect current encoding, downgrade to ascii only encoding)

tblackwe commented 1 year ago

There has been a lot of discussion in a slack thread about this issue. copying in @joshkimux's summary. The question of how to handle this scenario is being forward to design for additional input.

In the 526ez flow, there’s inputs with character limits like a text area where Veterans can provide supporting evidence. Take an input that has a 500 character limit Multibyte characters in supporting evidence uploads is a problem It will appear as though they are within the 500 character limit but When submitted to the backend, it will go over that character limit and get rejected e.g. Smart-quote (“) would take up more characters than a regular ascii quote (“)

How can we fix this? We’ve thought of: Adjusting the submission in the backend e.g. automatically swapping the smart quote with a regular ascii quote after the user submits Adjusting the submission in the frontend e.g. as the user types, the character is replaced Providing error validation in the frontend e.g. when the user attempts to submit, they’re told why they can’t use X characters and asked to replace them (or maybe we could provide a method to automatically fix it, but that would involve more work I’m guessing)

Why this gets tricky Automatically adjusting submissions could lead to language related changes e.g. ÿ -> y or ñ -> n If we update a submission in the backend, users will not know this change has been made If we update a submission in the frontend If done live as they type, screen reader users will not be alerted of the change (unless we put together a really tedious contraption to do so) If we provide error validation explaining why we can’t accept certain characters, it may create a ton of friction to go back and fix all the issues Even if we fixed it a little more upstream in Lighthouse, the multi-byte issue will persist since the upstream service (BGS) only accepts ASCII characters

Possible questions If we were to red team this or think of all the worst possible unintended outcomes, what are we looking at? e.g. could small character changes within a name or details in supporting evidence hurt a Veteran’s claim or cause confusion? could small character changes to language e.g. ñ -> n give the impression of an English-speaking bias to Veterans who english isn’t their first language? Is this a binary between correcting all characters or none? Or can we write in exceptions that may be more sensitive?

joshkimux commented 1 year ago

I've created a ticket for the sitewide content team to review ❤️

lydiahooper commented 1 year ago

Note that in the 526ez flow, these are the pages with text areas where there are lengthier character limits (not supporting evidence)...

If the user is filing a new claim or adding a new condition: https://www.va.gov/disability/file-disability-claim-form-21-526ez/disability/file-disability-claim-form-21-526ez/new-disabilities/follow-up (This is part of a loop pattern, so there are several URLs depending on how many new conditions)

If the user indicates they want to also apply for Individual Unemployability (Forms Form 21-8940 and 21-4192) while in 526 flow: https://www.va.gov/disability/file-disability-claim-form-21-526ez/disability/file-disability-claim-form-21-526ez/hospitalization-history https://www.va.gov/disability/file-disability-claim-form-21-526ez/disability/file-disability-claim-form-21-526ez/doctor-care https://www.va.gov/disability/file-disability-claim-form-21-526ez/disability/file-disability-claim-form-21-526ez/recent-earnings https://www.va.gov/disability/file-disability-claim-form-21-526ez/disability/file-disability-claim-form-21-526ez/unemployability-additional-information

Mottie commented 1 year ago

The page asking for the Veteran's address also includes inputs with pretty severe character limits:

RakshindaAslam commented 1 year ago

cc @tblackwe - need the following for review Impact summary

joshkimux commented 1 year ago

@RakshindaAslam @tblackwe quick update! I followed up with @DanielleThierryUSDSVA on this, who drafted some recommendations for how we can start working on this now and later ❤️

Now:

Next:

Later:

RakshindaAslam commented 1 year ago

@sethdarragile6 @mchae-nava - can we check if

mchae-nava commented 1 year ago

LH Thread: https://dsva.slack.com/archives/C02CQP3RFFX/p1688739944968979

Impact summary

Description Text fields pass FE validation, but are failing once they hit EVSS due to the encoding method of special characters. This is applicable to any text field with a character limit.

How many users Can't find great metrics to zero-in on a number of affected users, however it appears to be a rare issue, very easy to reproduce, and potential to impact any user

Is there a workaround? If the submission is failing, is it getting submitted via the paper form? Communicating with LH to see if we can have it resolved on their end

Severity Assigned ( sev-critical / sev-major / sev-moderate / sev-minor ) sev-minor I vote somewhere between moderate to minor. Metrics-wise, absolutely minor. But has high potential to affect any user, and will cause a submission to be rejected with no good feedback

austinjprice commented 1 year ago

@mchae-nava to follow up with EVSS to get logs to determine number of times this occurs for assessment

mchae-nava commented 1 year ago

Thread opened here: https://dsva.slack.com/archives/C8R3JS8BU/p1689088601774989 Will update my above assessment as it progresses

mchae-nava commented 1 year ago

Assessment updated with EVSS findings. Can follow up with more re: Kyle's error aggregation

austinjprice commented 1 year ago

@mchae-nava can you confirm if this error/failure is silent to the back up process? E.g. can you confirm when this occurs that a pdf is generated and sent through central mail?

sethdarragile6 commented 1 year ago

Michael Harlow replied on the above thread a potential fix (for form submission). It should be a fairly trivial fix on our side, but we would need him (or current EVSS devs) to work with us to ensure that changing the encoding doesn't have unexpected side effects on their side.

RakshindaAslam commented 1 year ago

If there is a potential testing issue with eVSS, consider getting this change included/tested for Lighthouse migration.

pacerwow commented 4 months ago

@sethdarragile6 do you know if this will be solved as a result of migrating to Lighthouse endpoints?

sethdarragile6 commented 4 months ago

@sethdarragile6 do you know if this will be solved as a result of migrating to Lighthouse endpoints?

@pacerwow I do not, and it would take some digging to determine. First question I have (maybe @mchae-nava or @tblackwe can confirm) is if we ever attempted to fix this issue against EVSS. Probably not, since this ticket is still open, but you neva know. But in light of the fact that LH has encountered issues with us sending unicode/special characters during TE dev testing, I would guess that it could well be an issue with them as well.

Mottie commented 4 months ago

comment from above:

Even if we fixed it a little more upstream in Lighthouse, the multi-byte issue will persist since the upstream service (BGS) only accepts ASCII characters

michelpmcdonald commented 4 months ago

Rough issue.

For the forms on.va.gov, i've speculated that since it's hard to actually type in multibyte characters(for example on my mac keyboard, to get a smartquote like this: “ it's the option key + [. I don't think many users actually know that and or actually do that.

I think that users are copying and pasting from apps like word where it's common to have auto-smart quotes enabled, user types ", but word auto converts to “.

Disable pasting into text boxes was a (bad)thought I had ???

^Won't help 526 evidence uploads suffer from the same issue, probably from the same source(copy\pasting or exporting from word and similar apps).

For evidence uploads, these are the encodings that I've commonly encountered: utf-8, windows-1252, macRoman. These were around too but pretty rare: UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE.

Mottie commented 4 months ago

On a mac, you can press and hold a letter, e.g. e to open a popup with the letter + diacritics, so it's not that hard

letter e with a popup showing 9 diacritic variations
aurora-a-k-a-lightning commented 3 months ago

An update this from the EVSS to Lighthouse migration persepctive: Using the Lighthouse submit endpoint(s) we are able to send “smart” quotes into the aforementioned field: homeless.currentlyHomeless.otherDescription

We will follow-up with other fields in our testing moving forward.

"currentlyHomeless": {
          "homelessSituationOptions": "FLEEING_CURRENT_RESIDENCE",
          "otherDescription": "‘“that’s” all folks!"
        },

image