Innovate-Inc / EDG_metadata

EDG metadata on staging created for Innovate-Inc
0 stars 1 forks source link

Can REST output be UTF-8 instead of ISO-8859-1 #76

Closed torrin47 closed 7 years ago

torrin47 commented 7 years ago

Per the email chain below, it appears that the output of the REST API is returning ISO-8859-1 even though the raw metadata records are being stored as UTF-8, which does funky things with some characters. It's not clear where this encoding switch occurs - is it just the HTTP header setting, or is there some constraint in Java? Is it an easy fix or something major? Let's investigate and/or ask Esri.

From: Hultgren, Torrin [mailto:Hultgren.Torrin@epa.gov] Sent: Wednesday, March 08, 2017 4:38 PM To: Felsher, Maxwell (CGI Federal) Cc: Greene, Ana; Suma Malothu; spierson@innovateteam.com; Harness, Catherine Subject: RE: Character encoding of DCAT output?

Hi Max,

I can’t think of any reason the charset of the response should be restricted toISO-8859-1 rather than the full domain of UTF-8, and it only seems to be applying to the REST API (https://edg.epa.gov/metadata/rest) rather than other URLs. I believe we should be able to fix it, but would you mind sharing an example of one of your records that had an encoding issue that we can use for testing?

The approach you’re working with is fine – it’s conducting a full search across all indexed fields, but seems to respond very quickly. To limit the search to just the fileIdentifier field, you could use this syntax: https://edg.epa.gov/metadata/rest/find/document?f=dcat&searchText=fileIdentifier:A-280j-22 but if there’s a performance improvement, it’s all but impossible to tell. But actually, if all you’re looking for is a way to directly reference your own records, you may also use your own identifiers – the EDG will respect them: https://edg.epa.gov/metadata/catalog/search/resource/details.page?uuid=A-280j-22 Might simplify things on your end?

Torrin

From: Felsher, Maxwell (CGI Federal) [mailto:maxwell.felsher@cgifederal.com] Sent: Wednesday, March 08, 2017 2:55 PM To: Hultgren, Torrin Hultgren.Torrin@epa.gov Cc: Greene, Ana Greene.ana@epa.gov Subject: Character encoding of DCAT output?

Hi Torrin,

We were trying to search for some EDG records in the DCAT JSON-LD format (e.g., https://edg.epa.gov/metadata/rest/find/document?searchText=A-280j-22&f=dcat), and we ran into an issue with character encoding; our code was assuming it was in UTF-8, but now we see that the HTTP response specifies ISO-8859-1 in the Content-Type header. We’re fixing our code to not assume UTF-8, but I was wondering whether it was intentional to use ISO-8859-1?

(As an aside, we’re doing this in order to be able to retrieve the corresponding EDG URL for a particular dataset we put in our metadata. We search for our identifiers using URLs like the above and then parse the response and extract the landingPage property. That was the best option we could figure out, but if you have other suggestions, let us know.)

Best, Max Felsher Consultant, CGI Federal Contractor to ORD (ScienceHub team)

torrin47 commented 7 years ago

This issue was moved to USEPA/EPA_Environmental_Dataset_Gateway#6