Closed ericaryan closed 5 years ago
A few more tricky-looking ones that the scraper might already handle:
SHAILENDRA KUMAR, DOCTOR, URDULA HOSPITAL (through translator): P.C. VYAS, CHIEF ENGINEER, STATUE OF UNITY PROJECT (via translator) [02:50:05] TEDROS ADHANOM GHEBREYESUS, DIRECTOR-GENERAL, WORLD HEALTH ORGANIZATION
Both from http://transcripts.cnn.com/TRANSCRIPTS/1810/31/cnr.19.html. We've got multiple commas, parentheticals, a timestamp, and two translator prefixes ("through" and "via").
Example from Tech & Check Alerts: CNN 08/31/19
JOHANNA HAMILTON on CNN'S AMANPOUR (CNN): And then most of them are between $100 and $500 million each fund and there are sometimes multiple funds that are raised over a few years.
http://transcripts.cnn.com/TRANSCRIPTS/1908/29/ampr.01.html
Should be: ARLAN HAMILTON, FOUNDER AND MANAGING PARTNER, BACKSTAGE CAPITAL
Not: JOHANNA HAMILTON, DIRECTOR, "THE TRIAL" (She was the first HAMILTON in the transcript)
Over in #167 we're tracking attribution errors that have been noted in the wild with the new scraper, so I think this issue has served its porpoise.
Per some Slack convo, gonna re-open this since it's a different / higher class of problems than other attribution cleanup.
Hoo boy, apologies to historians having to follow along here, but this actually does belong in another issue, just not #167. It belongs in #65. Done and done.
Here are a couple of quirks with CNN transcripts that have tripped up our system in the past (which Asa may have implemented fixes for, and which you may be handling fine already, but just in case it's helpful...)
Issues with parentheticals after source name: JOE LHOTA, MTA CHAIRMAN (ph): http://transcripts.cnn.com/TRANSCRIPTS/1712/11/cnr.02.html
LAH (voice-over): http://transcripts.cnn.com/TRANSCRIPTS/1712/12/cnr.02.html
Parenthetical attribution for “(INAUDIBLE)”: http://transcripts.cnn.com/TRANSCRIPTS/1802/02/cnr.20.html
It looks like short last names have also caused issues: SY: http://transcripts.cnn.com/TRANSCRIPTS/1801/30/qmb.01.html
RAI: http://transcripts.cnn.com/TRANSCRIPTS/1802/24/vssg.01.html
I can dig up the associated emails where the attributions were messed up, if you need them!