FullFact / health-misinfo-shared

Raphael health misinformation project, shared by Full Fact and Google
MIT License
0 stars 0 forks source link

Captions aren’t extracted for some videos #113

Closed andylolz closed 1 month ago

andylolz commented 1 month ago

Describe the bug

The transcript isn’t extracted for some videos. I think there are a few problems here:

1. The regex is a little bit wrong

Here’s the current regex, followed by a modified regex:

sample = """
<?xml version="1.0" encoding="utf-8" ?>
<transcript>
    <text start="0.199" dur="6.37">Hello and welcome to my YouTube channel.\nMy name is Sara Peternell and I&amp;#39;m the</text>
    <text start="6.569" dur="3.121">owner of Family Nutrition Services in\nDenver, Colorado.</text>
</transcript>
"""

caption_re = re.compile('\<text start="(?P<start>[0-9\.]*?)" dur="[0-9\.]*?">(?P<sentence_text>.*?)</text>')
list(caption_re.finditer(sample))  # no matches

modified_caption_re = re.compile('<text start="(?P<start>[0-9\.]*?)" dur="[0-9\.]*?">(?P<sentence_text>[^<]*)<\/text>')
list(modified_caption_re.finditer(sample))  # 2 matches

2. We’re using regex here at all

I think it would be better to use an XML parser (e.g. lxml directly or BeautifulSoup) for parsing HTML and XML. Doing this with regex is a bit hairy.

3. Duplicate code

There’s still the problem here of duplicate code, which threw me off. Very very similar code for extracting captions exists in two different places (in youtube.py and in youtube_api.py) – both of which have this same bug.

4. We’re throwing duration information away

Bit of an aside, but we throw caption duration away here, and later on we attempt to recreate it from start timestamps. It would make life a lot easier to capture the original duration data.

JamesMcMinn commented 1 month ago

I know most (possibly all) of the people reading this will be familiar with this legendary SO answer, and it's XML we're trying to parse, not HTML, but I don't like to miss an opportunity to share it.

ff-dh commented 1 month ago

I agree grepping the XML isn't the best solution but as long as we can fix the regex I don't think it's worth migrating this MVP to parse the XML correctly. I'm pretty sure there are ways to download the vtt-formatted subtitles for an arbitrary video then iterating over those with webvtt-py (iirc used in audio-transcriber). So it's between parsing xml or vtt, properly. Python does have the built-in xml.dom.minidom but I've never used it