Closed josephpd3 closed 7 years ago
The failed_urls
bit can either be excluded entirely or tied to the function with a closure so it can be used just as seamlessly with pandas as it is in the notebook.
I updated the notebook to export the IDs etc to a .csv file. Let me know if/where you'd like me to upload it!
Great work, thank you for this PR! We will work to get this into our data cleaning pipeline we are building on top of our eventador infrastructure. When we get to that point I will work with you to integrate.
When you get a moment can you direct msg me on slack for details with what to do with the output file. Thanks again, great PR.
Close #54
Create notebook that uses
pandas
,re
, andurllib
to match common patterns for youtube video URLs and extract the video ID. The function also takes into account whether it is a share URL and resolves it to get the final video. After extracting all the IDs, the notebook also creates a column of reconstructed URLs (of the typical variety for youtube, with the obvious video ID) which can be used as necessary.It should be relatively easy to port this function over to either its own module or another, existing module.
It currently extracts 39,992 video IDs from the 40,745 URLS, but some of these may lead to videos that have since been disabled or removed. I have no idea if this affects the metadata, but it is likely past the scope of simply retrieving the IDs.
As the code is in a notebook, I'll put the main function and accompanying regex here: