URL domain extraction - Githubissues

bstarling commented 7 years ago

Problem:

In order to do analysis on types of links being shared we need a reliable way to extract & count domains that appear in a list of URLs.

Tasks:

There are libraries that do this but none of them are perfect (It's fine to leverage a library but try to do your own validation on the results)
Attempt to alias domains known to be associated and count together youtu.be and youtube.com are both youtube
Make sure you're capturing the actual domain ex forums.website.com the domain is website
Output should be a list of domain counts in descending order.
Sort out shortened links and publish as a separate file. Ex bit.ly, t.co

You do not need a full solution in order to submit a PR. If you have questions drop in to assemble chat and see if anyone else is interested in working on the problem.

You can download the data here or load directly to pandas via

import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/far-right/fourchan/youtube_urls.csv')

Post cleaning should generate a list of domains and their count. As well as a separate file of all shortened links where domain is not know. (Recommend you do not try to visit these shortened links)

youtube, 500
facebook, 200
twitter, 150
wikipedia, 100

warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board. Please do not visit the links you find as some may contain malware/offensive content.

strongdan commented 7 years ago

I'm a beginner looking to gain a bit more experience and I'd be willing to attempt this. How quickly do you expect a PR?

bstarling commented 7 years ago

Hey @strongdan that sounds good. No time limit. The only request is if end up not finding time to finish you post back here to free it up for someone else. Feel free to post a partial solution so others can collaborate or drop by the #assemble channel with any questions.

strongdan commented 7 years ago

Sounds great! I'll try to get something completed this weekend and update you with what I have.

strongdan commented 7 years ago

I didn't see that @harish-garg already solved this one. Great job!

bstarling commented 7 years ago

Still room for improvement or alternative approaches.

strongdan commented 7 years ago

I had to post on StackOverflow about sorting out short URLs: http://stackoverflow.com/questions/43219063/detecting-a-short-url-using-python

It sounds tough to implement. I can try to come up with a list of known short urls or match on a regular expression. I will most likely need some help with the cleaning and validation of results.

Data4Democracy / assemble

URL domain extraction #56

Problem:

Tasks: