CDRH / earlywashingtondc

OSCYS Rails site
http://earlywashingtondc.org
2 stars 0 forks source link

Case file character encoding #221

Open techgique opened 6 years ago

techgique commented 6 years ago

Rails started throwing the error: incompatible character encodings: ASCII-8BIT and UTF-8 for some case file pages.

E.g.

Karin removed some text from the generated HTML file and it would work again but haven't identified the offending character or anything yet.

Used a dirty fix from Stack Overflow (https://stackoverflow.com/a/9278713) in https://github.com/CDRH/earlywashingtondc/commit/292e018c1a4ef549d3256c0f14399adee9b737af

karindalziel commented 6 years ago

I have identified the character causing the issue in oscys.caseid.0105:

https://cdrhdev1.unl.edu/earlywashingtondc/cases/oscys.caseid.0105

That doesn't really help, since we can't remove all em dashes.

So, Idea #1: is there a way to change the html creation script to explicitly set the encoding to UTF-8? (maybe this would be useful? I'm not quite sure how to implement it https://gist.github.com/arpith20/4fcf7682a9154bc777dfcd2199edecf4)

If that does not work, idea #2 will be to re-encode special characters with html tags, but I am hoping not to have to do that.

Update: I checked the XSLT file creating the HTML for oscys (scripts/tei.p5) and I think it is setting encoding correctly:

<xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>

The XML files also correctly set the encoding as UTF-8, though it is possible that the original file is using a non UTF-8 encoding of the em dash.

karindalziel commented 6 years ago

A little more investigation:

I opened the file the HTML is transformed from, and the encoding of the em dash looks like this:

summons&#8212;

I believe 8212 is the HTML encoding of the em dash, but it doesn't work if I change it to &#2014; or either. So, we're back to having to try one of the ideas above.

kacinash commented 6 years ago

Should we just change all of them to two minuses? -- Any idea why it only seems to be a problem on the case files and not the documents?

karindalziel commented 5 years ago

I'm not sure if this is still an issue, but it would be good to find out.

jduss4 commented 5 years ago

@kacinash do you know if this is resolved or you have a workaround?

kacinash commented 5 years ago

I don't know. I'm not sure how to replicate the process Greg did that got him the error.

techgique commented 5 years ago

I think we'd have to revert the change I added in https://github.com/CDRH/earlywashingtondc/commit/292e018c1a4ef549d3256c0f14399adee9b737af and review pages with the suspect characters