abtassociates / eva

Eva is a HUD application to aid HMIS Leads with data analysis. It is an open-source project intended for local use by HMIS Administrators in Continuums of Care (CoCs) around the U.S. and its territories.
GNU Affero General Public License v3.0
14 stars 4 forks source link

Impermissible Character Update - 12/14/2023 #464

Closed LaurenBianchi closed 7 months ago

LaurenBianchi commented 8 months ago

Hello and Happy Holidays!

I saw in the change log there was a push on 12/14 changing how impermissible characters are checked/flagged during submissions. We are getting reports of some characters that are permitted (' and `) that are getting flagged as impermissible characters.

I have submitted an AAQ to inquire about guidance for when there are responses that utilize characters with diacritical marks as they are not explicitly permitted but there is no further guidance on stripping/transforming the characters either.

Is it possible for this flag to be a warning as opposed to an error in the interim so folks can push their csv through for DQ checks or would that cause the tool to crash again?

Thank you for your time and insight, LB

alex-silverman commented 8 months ago

Hi Lauren,

We have heard this feedback from a few users, so we are going to, at least temporarily, not reject files due to these impermissible characters. Eva has been updated accordingly.

I will note, I'm surprised Eva flagged a normal apostrophe (') as one such character. right- and left-single quotation marks (` and ’), on the other hand, are intentionally flagged, as these can cause Eva to crash. Can you confirm that Eva was flagging normal apostrophes?

Alex

LaurenBianchi commented 8 months ago

Hi, Alex! Just wanted to reach out - I am waiting for word back from the customer for report parameters/project IDs to confirm the reported issue. My apologies for the delay.

alex-silverman commented 8 months ago

No problem. Thank you for following up!

LaurenBianchi commented 7 months ago

Hi @alex-silverman! I am seeing some odd behaviors in the file structure analysis. These are the impermissible characters that are being flagged: image

I have been reviewing the flagged lines within the CLS file and some are valid (i.e. "…"), but others are a bit less forward during review (i.e., " ,,,,,")

alex-silverman commented 7 months ago

Hi Lauren,

Eva is definitely correctly flagging the right-single quotation mark (e.g. the last 2 rows in your screenshot). And, after some investigation, I'm guessing all those "commas" are actually "single low-9 quotation marks" (U+201A).

My colleague, @kiadso , pointed me to a really helpful ASCII-converter website: https://www.branah.com/ascii-converter

When I type a comma into that I see this: image

But I found a comma-like character called the "Single low-9 quotation mark" that shows this: image

You can see the Hex/Unicode and Decimal codes are different. You can try copying one of the "commas" into that website and confirm that they're returning the same Hex and Decimal codes as the second screenshot.

I'm going to guess that the other characters being flagged are also non-ascii and therefore marked as "impermissible". For example, the "non-breaking space" (U+00A0) looks just like a regular space, but isn't. That would probably explain those first 2 rows in your screenshot. And a horizontal ellipsis (U+2026) looks like a regular ellipsis, but isn't.

LaurenBianchi commented 7 months ago

Hello, I used the character pulled directly and received the following in the converter: image

Additionally, there are instances when we receive ",,," errors and there are no characters that mimic the appearance of a comma within the column/line reference for the error. When I review the detail in the file structure analysis the "," characters also appear as being an expected comma in the ASCII converter interface. These commas appear to be separated on different lines - could this be caused by folks hitting enter while entering their data? image

alex-silverman commented 7 months ago

Oh my gosh, I'm being so silly. So Eva reports all impermissible characters within a given cell, separated by a comma. Based on your screenshots, I'm now guessing that the impermissible character is a non-breaking space, a non-ascii carriage return (U+21B5), or a zero-width non-breaking space character (U+FEFF). Can you try copying the characters between the commas or, better yet, the entire string from Eva's output (so we get the impermissible characters and the commas separating them), into that converter and paste the Hex results into here? We should see a bunch of 0x2c with some other stuff in between. I'm guessing/hoping those other things will turn out to be these non-breaking spaces or the like

LaurenBianchi commented 7 months ago

From: Found impermissible character(s) in CurrentLivingSituation.csv, column 13, line 770: , , , image

Can we allow these characters(non-breaking space, a non-ascii carriage return (U+21B5), or a zero-width non-breaking space character (U+FEFF))? Across the board from my experience from the data entry world - sys admin - vendor data analyst these are used often enough where transforming this data in the code veering from a SSOT would not be ideal and asking users to change how data is being entered/requesting them to fix is also not ideal. If this doesn't cause Eva to crash I would love to advocate for it to be accepted for data review.

alex-silverman commented 7 months ago

Okay, so just to confirm, it does look like these are non-breaking spaces, zero-width spaces, and maybe non-ascii carriage returns. So at least we know what's happening!

I do appreciate your request for allowing these characters. This is why we made these "errors" rather than "High Priority", the latter of which would prevent you from using Eva. In this way, you can use Eva as normal, despite these issues. We do still recommend keeping to ASCII characters as these are more standardized across various platforms. And I should warn you that certain non-ASCII characters can cause problems for Eva. In our experience right-single quotation marks have crashed Eva, and these may be in 101 of your records, based on your first screenshot, though it's hard to say without seeing the character codes.

I hope this helps!

Given that we do still prefer ASCII characters but Eva should still allow many non-ASCII characters, I'm going to close this issue.

LaurenBianchi commented 7 months ago

For reference purposes down the line if/when CSV Specifications are updated, I would like to note that the "`" character (ASCII Code 96/ Unicode U+0060) and spaces (which can be interpreted in my opinion to allow the characters we discussed yesterday) are permitted per the CSV Specifications. image The only characters that are explicitly impermissible requiring data transformation are < > [ ] { }

kiadso commented 7 months ago

Hi Lauren,

The Data Standards team has this on its agenda for the FY2026 specifications and we really appreciate your feedback and help in breaking this down. While the current specs do not explicitly name that the allowable characters are the ASCII versions of them all, I have confirmed that that was the intent.

Eva should not be flagging the backquote (`), nor should it be flagging ASCII spaces and any other ASCII characters. Just to confirm: is Eva flagging ASCII characters?

If Eva is flagging ASCII characters that are explicitly allowed, that is an issue we will work on right away. If not, this is a chance for the Data Standards team to reassess what standards are reasonable going forward and to define them clearly in FY 2026.

Thanks again for all your help with this, @LaurenBianchi!

kiadso commented 7 months ago

Hi, I'm going to close this, but please re-open if you feel Eva is flagging any ascii characters that are allowed. In the meantime, the Errors that Eva is throwing will not cause anyone to lose the ability to use Eva.

Thanks again!