Open bjacobel opened 9 years ago
May be related to #21
i think this is basically unfixable at this point. even by the time i first saw the db, they were all straight-up ascii question marks. if the browser were just rendering special chars as ?s we'd be in better shape but that's not it. :-/
i flirted with whipping up a data cleaning mini-app - i think a simple regex could spot many of the bad ?s and then a human could choose from likely possible corrections - prob not worth it, idk On Sun, Mar 29, 2015 at 19:16 Brian Jacobel notifications@github.com wrote:
May be related to #21 https://github.com/BowdoinOrient/bonus/issues/21
— Reply to this email directly or view it on GitHub https://github.com/BowdoinOrient/bonus/issues/118#issuecomment-87495555.
s/(\?)\b\w+\b/‘/g
and s/\b\w+\b(\?)/’/g
would fix a large number of the issues, probably.
alternately: s/\?(\b\w+\b)\?/“$1”/g
I love regex golf 💃
Example here. Fairly common with pre-2010 articles.
This may be unfixable. Have to look and see if they're actually saved that way in the DB, or if this is just a presentation layer thing.