Data4Democracy / assemble

NOT AN ACTIVE PROJECT -- Check readme for data sources
MIT License
36 stars 27 forks source link

Create Jupyter Notebook for Youtube Video ID Extraction #57

Closed josephpd3 closed 7 years ago

josephpd3 commented 7 years ago

Create notebook that uses pandas, re, and urllib to match common patterns for youtube video URLs and extract the video ID. The function also takes into account whether it is a share URL and resolves it to get the final video. After extracting all the IDs, the notebook also creates a column of reconstructed URLs (of the typical variety for youtube, with the obvious video ID) which can be used as necessary.

It should be relatively easy to port this function over to either its own module or another, existing module.

It currently extracts 39,992 video IDs from the 40,745 URLS, but some of these may lead to videos that have since been disabled or removed. I have no idea if this affects the metadata, but it is likely past the scope of simply retrieving the IDs.

As the code is in a notebook, I'll put the main function and accompanying regex here:

# Regex pattern to grab the URLs with video IDs
# Generally, these are 11-character strings of alphanumeric characters with _ and - mixed in.
id_pattern = '''
    watch\?v=([\w\-]{11})               # Typical URL
    |\/v\/([\w\-]{11})                  # Typical URL Variant
    |youtu\.be\/([\w\-]{11})            # Shortened URL
    |\&v=([\w\-]{11})               # Encoded URL
    |embed\/([\w\-]{11})                # Embedded URL
    |watch\%3Fv\%3D([\w\-]{11})         # Really nasty referral URL
    |savieo\.com\/youtube\/([\w\-]{11}) # Seems to be a site for linking to videos
'''

# Gotta be able to grab share URLs from the rejects`
shared_url_pattern = 'shared\?ci=(.+)'

# Store failures to rummage through...
failed_urls = []

def grab_youtube_id(url):
    """
    Given a URL, attempt to parse out a YouTube video ID.
    Falls back to attempt to resolve a shared video URL if the URL
    matches the pattern for it.
    """
    try:
        # Sift through entire tuple to ensure the non-None match is returned
        found_groups = re.search(id_pattern, url, re.VERBOSE).groups()
        return [g for g in found_groups if g is not None][0]
    except:
        if re.search(shared_url_pattern, url) is not None:
            # Try to resolve a share URL to get the final video location
            try:
                resolved_share = urllib.request.urlopen(url)
                # Grab the actual video id from the resolved share link
                return re.search(id_pattern,
                                 resolved_share.geturl(),
                                 re.VERBOSE).groups()[0]
            except:
                # Away with you!
                failed_urls.append(url)
                return None
        else:
            # Append the rejects to the reject list!
            failed_urls.append(url)
            return None
josephpd3 commented 7 years ago

The failed_urls bit can either be excluded entirely or tied to the function with a closure so it can be used just as seamlessly with pandas as it is in the notebook.

josephpd3 commented 7 years ago

I updated the notebook to export the IDs etc to a .csv file. Let me know if/where you'd like me to upload it!

bstarling commented 7 years ago

Great work, thank you for this PR! We will work to get this into our data cleaning pipeline we are building on top of our eventador infrastructure. When we get to that point I will work with you to integrate.

When you get a moment can you direct msg me on slack for details with what to do with the output file. Thanks again, great PR.

Close #54