WGBH-MLA / openvault3

Apache License 2.0
2 stars 3 forks source link

Apostrophes not coming through in Vietnam summaries #168

Open mccalluc opened 8 years ago

mccalluc commented 8 years ago

For example: /catalog/V_2758B13C7014459498315BDF568EC3BC

He describes U.S. presidents’ different stances toward Vietnam

Character encoding issue. See if the source data is correct, and possibly pass to Kevin, or maybe do an adhoc fix on this end.

Muraszko-wgbh commented 8 years ago

@mccalluc I'll investigate.

foo4thought commented 8 years ago

I tested a workflow that ran data through TextWrangler, using its "zap gremlins" function. This would be done to the XML payload bound for the ingestion process using Chuck's code, with options to process (or not) these things: Non-ASCII characters Control characters NULL (ASCII 0) characters to become either ASCII equivalent or HTML entity, e.g., Ò

Muraszko-wgbh commented 8 years ago

@foo4thought Can you do this to the data and have the corrections make their way back to Filemaker itself? I'd rather the workflow be able to go from Database -> export-> server

mccalluc commented 8 years ago

Before we get too far into this, I can't swear that I'm not getting it wrong on my end. If it is upstream, I'd be a little concerned about:

to become either ASCII equivalent or HTML entity, e.g., Ò

That might not be sufficiently general: Consider a Vietnamese character with diacritics: Not going to have a named entity, I don't think, but we don't want to just strip diacritics either.

Muraszko-wgbh commented 8 years ago

Well, the character is still present but not displaying on the current OV record page. Is it showing up on the actual page in the staging server? It doesn't show up in the Filemaker field for that record, only when pasting into a text editor.

Should we still try and process the actual encoding issues?

mccalluc commented 8 years ago

Here's a better diagnosis: The character in question is U+0092, which in Windows-1252 is an apostrophe. Most of Windows-1252 is identical to Latin-1, and Latin-1 in turn can be read as utf-8... but x80-x9F in 1252 is an exception: Those points are not in Latin-1, and are private use in Unicode.

@Muraszko : just do a search and replace in FM? Might also look for other characters in this range while you're on it? ... but if that's hard for some reason, not impossible to do adhoc cleaning downstream, though it wouldn't be my preference.

Muraszko-wgbh commented 8 years ago

@mccalluc Thanks Chuck. I agree, clean it at the source.

mccalluc commented 8 years ago

@Muraszko : If you want another example: V_9E5411CEBD4447CE9D9DACFCAFAC0667

protesting against the forced labor of their husbands in Ngo Dinh Diem’s agroville or “Strategic Hamlet” program

There is a x93 / x94 pair on either side "Strategic Hamlet".

Muraszko-wgbh commented 8 years ago

Thanks Chuck. I'll try and correct them in Filemaker.