Strip youtube ID from youtube links

bstarling commented 7 years ago

The problem:

A goal of an progress data pipeline is to extract youtube links found in text blobs then poll youtube API to to get additional video metadata. In order to poll API we need to extract the youtube video ID from the URL.

Tasks

Determine if link is actually a link to a video
Isolate youtube ID
Create output CSV which contains original_url and youtube_id
Submit a PR to this repository as a stand alone file or Jupyter notebook Even if you only have a partial solution, PR is encouraged so others can pickup where you left off or provide suggestions.

Additional Info

The base case is links will look like https://www.youtube.com/watch?v=DiTECkLZ8HM the youtube ID is DiTECkLZ8HM. Create a csv file with two columns original_url, youtube_id.

Links will come in many formats.
Some examples:

https://youtu.be/C-XXXXXXXXXXX
https://www.youtube.com/watch?v=XXXXXXXXXXX&t=538s
https://www.youtube.com/watch?v=XXXXXXXXXXX&list=LLqm0Q-XmsHWX_Gklk-NAUAw&index=43
https://www.youtube.com/user/XXXXXX/videos (would be discarded since it is not a link to a video)

A sample of 40,000 URLs to be used for testing purposes can be found here

Warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board.

josephpd3 commented 7 years ago

@bstarling I'd love to work on this this weekend, though I'm not available until later Friday night and Saturday night. Is this reserved for full-time participants in the Hackathon, or can I take a crack at it?

bstarling commented 7 years ago

@josephpd3 that is no problem. The hackathon is remote / asynchronous so you are welcome to tackle it this weekend. My only request if you end up not having time just come back and let us know so we can free it up for someone else to work on. Feel free to drop in chat if you want to see if anyone else is interested in working with you.

bstarling commented 7 years ago

Closed with #57

Data4Democracy / assemble