BowdoinOrient / bonus

Bowdoin Orient Network Update System v2, the bowdoinorient.com frontend and backend since August 2012.
http://bowdoinorient.com/
Other
3 stars 2 forks source link

Character encoding issues with articles from previous-gen site #118

Open bjacobel opened 9 years ago

bjacobel commented 9 years ago

Example here. Fairly common with pre-2010 articles.

This may be unfixable. Have to look and see if they're actually saved that way in the DB, or if this is just a presentation layer thing.

bjacobel commented 9 years ago

May be related to #21

tophtucker commented 9 years ago

i think this is basically unfixable at this point. even by the time i first saw the db, they were all straight-up ascii question marks. if the browser were just rendering special chars as ?s we'd be in better shape but that's not it. :-/

i flirted with whipping up a data cleaning mini-app - i think a simple regex could spot many of the bad ?s and then a human could choose from likely possible corrections - prob not worth it, idk On Sun, Mar 29, 2015 at 19:16 Brian Jacobel notifications@github.com wrote:

May be related to #21 https://github.com/BowdoinOrient/bonus/issues/21

— Reply to this email directly or view it on GitHub https://github.com/BowdoinOrient/bonus/issues/118#issuecomment-87495555.

bjacobel commented 9 years ago

s/(\?)\b\w+\b/‘/g and s/\b\w+\b(\?)/’/g would fix a large number of the issues, probably.

bjacobel commented 9 years ago

alternately: s/\?(\b\w+\b)\?/“$1”/g

I love regex golf 💃