USEPA / EPA_Environmental_Dataset_Gateway

U.S. EPA’s Metadata Catalog
https://edg.epa.gov
3 stars 2 forks source link

Non-ASCII character problems #25

Closed torrin47 closed 6 years ago

torrin47 commented 6 years ago

See long thread below, is odd, special characters are handled correctly on details page but not via REST API. Can this be fixed to appropriately use UTF-8?

From: Felsher, Maxwell (CGI Federal) Sent: Tuesday, June 27, 2017 10:52 PM To: Hultgren, Torrin Cc: Montilla, Alex; Greene, Ana; Holochwost, Bill (CGI Federal); Aguilar, Katherine R (CGI Federal); Lewis, John E (CGI Federal); Sun, Tian (CGI Federal); Petriccione, Nicholas M (CGI Federal); Pierson, Suzanne Subject: RE: Display of non-ASCII characters in titles on EDG pages Hi Torrin,

I'm not sure exactly what you mean, but I think you're incorrect about whether they can be expressed as UTF-8 if that's what you mean. U+201C (http://www.fileformat.info/info/unicode/char/201C/index.htm) can be expressed as the bytes 0xe2 0x80 0x9c, and if you make sure a client is expecting UTF-8, it'll display that as an open curly quote. It's only if it's treating it as a different encoding that it'll appear weird. The bytes 0xe2 0x80 0x9c are interpreted as “ in Windows-1252, which is what's shown on the EDG page. So I'm pretty sure we're outputting the character in the correct UTF-8 bytes and something is deciding to read it as Windows-1252 or a similar encoding somewhere along the way.

The abstract issue is actually unrelated. We're importing a "citation" field that happens to have the same characters (because the user is using the title of an article for the title of their dataset) from a different ORD system (STICS). Unfortunately they HTML-encode their values in their database, which means the string itself is HTML-encoded even in our JSON. We could try to unencode their strings but so far haven't done so.

From: Hultgren, Torrin [Hultgren.Torrin@epa.gov] Sent: Tuesday, June 27, 2017 8:14 PM To: Felsher, Maxwell (CGI Federal) Cc: Montilla, Alex; Greene, Ana; Holochwost, Bill (CGI Federal); Aguilar, Katherine R (CGI Federal); Lewis, John E (CGI Federal); Sun, Tian (CGI Federal); Petriccione, Nicholas M (CGI Federal); Pierson, Suzanne Subject: RE: Display of non-ASCII characters in titles on EDG pages Hi Max,

Those are classic Microsoft-encoded quotes that aren’t really even UTF-8. This page explains more: https://stackoverflow.com/questions/3224427/python-sanitize-a-string-for-unicode We can try to fix the different appearances across the EDG, but note that it’s not just the title, it’s also problematic in the abstract: EDG: Raw data file outputs of serum and urine measurements of GenX in dosed rodents. This dataset is associated with the following publication: Rushing, B., Q. Hu, J. Franklin, R. McMahen, S. Dagnino, C. Higgins, M. Strynar, and J. DeWitt. Evaluation of the Immunomodulatory Effects of 2,3,3,3-tetrafluoro-2-(heptafluoropropoxy)-propanoate (“GenX”) in C57BL/6 Mice. ENVIRONMENTAL TOXICOLOGY. John Wiley & Sons, Ltd., Indianapolis, IN, USA, 156(1): 179-189, (2017). Data.gov: This dataset is associated with the following publication: Rushing, B., Q. Hu, J. Franklin, R. McMahen, S. Dagnino, C. Higgins, M. Strynar, and J. DeWitt. Evaluation of the Immunomodulatory Effects of 2,3,3,3-tetrafluoro-2-(heptafluoropropoxy)-propanoate ([HTML_REMOVED]GenX[HTML_REMOVED]) in C57BL/6 Mice. ENVIRONMENTAL TOXICOLOGY. John Wiley [HTML_REMOVED] Sons, Ltd., Indianapolis, IN, USA, 156(1): 179-189, (2017).

If there’s any way you can sanitize these quotes at the source, I think it’d ensure the best compatibility all the way down the chain.

Torrin

From: Felsher, Maxwell (CGI Federal) [mailto:maxwell.felsher@cgifederal.com] Sent: Tuesday, June 27, 2017 10:00 AM To: Hultgren, Torrin Hultgren.Torrin@epa.gov Cc: Montilla, Alex Montilla.Alex@epa.gov; Greene, Ana Greene.ana@epa.gov; Holochwost, Bill (CGI Federal) Bill.Holochwost@cgifederal.com; Katie.French@cgifederal.com; Lewis, John E (CGI Federal) john.e.lewis@cgifederal.com; Sun, Tian (CGI Federal) tian.sun@cgifederal.com; Petriccione, Nicholas M (CGI Federal) nicholas.petriccione@cgifederal.com Subject: Display of non-ASCII characters in titles on EDG pages

Hi Torrin,

I've noticed that at least some non-ASCII characters in titles of datasets aren't displaying properly in some places on their "details" pages. For example, there are some "curly quotes" toward the end of the title at https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=%7BC6AE0507-D98E-45CB-8B8B-B421731BA330%7D. Interestingly, the quotes display correctly in the "Title" field in the "Identification Information" box but not in the heading right above it or the final box a little above the "Open"/"Details"/"Metadata" links. It looks like what should be UTF-8 bytes are being interpreted as a different encoding.

I did a bit of digging to see if the issue was on our end. These characters also display incorrectly in https://pasteur.epa.gov/metadata.json when viewed directly in Firefox and Chrome, and I noticed that the JSON file is being served with a content-type of "application/json" without any charset. It should be possible to add the charset if we need to, and that might fix the browser issue, but the big issue is whether it fixes things for you all. I also discovered that the RFC for JSON says UTF-8 is the default and a "charset" specification is superfluous (https://tools.ietf.org/html/rfc7158#section-11). I think we're OK with stepping outside the spec if it fixes the issue, but we don't want to do it if it doesn't fix it. The fact that it looks OK in one of the fields on the page makes me suspect that it doesn't have anything to do with the content-type we're using.

Let us know if you could use any help with testing this.

Thanks, Max

torrin47 commented 6 years ago

Fixed!