TechAndCheck / tech-and-check-alerts

Daily tip sheet for fact checkers
MIT License
13 stars 6 forks source link

Double-check handling of attributions for CNN transcript quirks #71

Closed ericaryan closed 5 years ago

ericaryan commented 5 years ago

Here are a couple of quirks with CNN transcripts that have tripped up our system in the past (which Asa may have implemented fixes for, and which you may be handling fine already, but just in case it's helpful...)

Issues with parentheticals after source name: JOE LHOTA, MTA CHAIRMAN (ph): http://transcripts.cnn.com/TRANSCRIPTS/1712/11/cnr.02.html

LAH (voice-over): http://transcripts.cnn.com/TRANSCRIPTS/1712/12/cnr.02.html

Parenthetical attribution for “(INAUDIBLE)”: http://transcripts.cnn.com/TRANSCRIPTS/1802/02/cnr.20.html


It looks like short last names have also caused issues: SY: http://transcripts.cnn.com/TRANSCRIPTS/1801/30/qmb.01.html

RAI: http://transcripts.cnn.com/TRANSCRIPTS/1802/24/vssg.01.html

I can dig up the associated emails where the attributions were messed up, if you need them!

reefdog commented 5 years ago

A few more tricky-looking ones that the scraper might already handle:

SHAILENDRA KUMAR, DOCTOR, URDULA HOSPITAL (through translator): P.C. VYAS, CHIEF ENGINEER, STATUE OF UNITY PROJECT (via translator) [02:50:05] TEDROS ADHANOM GHEBREYESUS, DIRECTOR-GENERAL, WORLD HEALTH ORGANIZATION

Both from http://transcripts.cnn.com/TRANSCRIPTS/1810/31/cnr.19.html. We've got multiple commas, parentheticals, a timestamp, and two translator prefixes ("through" and "via").

ericaryan commented 5 years ago

Example from Tech & Check Alerts: CNN 08/31/19

JOHANNA HAMILTON on CNN'S AMANPOUR (CNN): And then most of them are between $100 and $500 million each fund and there are sometimes multiple funds that are raised over a few years.

http://transcripts.cnn.com/TRANSCRIPTS/1908/29/ampr.01.html

Should be: ARLAN HAMILTON, FOUNDER AND MANAGING PARTNER, BACKSTAGE CAPITAL

Not: JOHANNA HAMILTON, DIRECTOR, "THE TRIAL" (She was the first HAMILTON in the transcript)

reefdog commented 5 years ago

Over in #167 we're tracking attribution errors that have been noted in the wild with the new scraper, so I think this issue has served its porpoise.

reefdog commented 5 years ago

Per some Slack convo, gonna re-open this since it's a different / higher class of problems than other attribution cleanup.

reefdog commented 5 years ago

Hoo boy, apologies to historians having to follow along here, but this actually does belong in another issue, just not #167. It belongs in #65. Done and done.