Data4Democracy / assemble

NOT AN ACTIVE PROJECT -- Check readme for data sources
MIT License
36 stars 27 forks source link

URL domain extraction #56

Closed bstarling closed 6 years ago

bstarling commented 7 years ago

Problem:

In order to do analysis on types of links being shared we need a reliable way to extract & count domains that appear in a list of URLs.

Tasks:

You do not need a full solution in order to submit a PR. If you have questions drop in to assemble chat and see if anyone else is interested in working on the problem.

You can download the data here or load directly to pandas via

import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/far-right/fourchan/youtube_urls.csv')

Post cleaning should generate a list of domains and their count. As well as a separate file of all shortened links where domain is not know. (Recommend you do not try to visit these shortened links)

youtube, 500
facebook, 200
twitter, 150
wikipedia, 100

warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board. Please do not visit the links you find as some may contain malware/offensive content.

strongdan commented 7 years ago

I'm a beginner looking to gain a bit more experience and I'd be willing to attempt this. How quickly do you expect a PR?

bstarling commented 7 years ago

Hey @strongdan that sounds good. No time limit. The only request is if end up not finding time to finish you post back here to free it up for someone else. Feel free to post a partial solution so others can collaborate or drop by the #assemble channel with any questions.

strongdan commented 7 years ago

Sounds great! I'll try to get something completed this weekend and update you with what I have.

strongdan commented 7 years ago

I didn't see that @harish-garg already solved this one. Great job!

bstarling commented 7 years ago

Still room for improvement or alternative approaches.

strongdan commented 7 years ago

I had to post on StackOverflow about sorting out short URLs: http://stackoverflow.com/questions/43219063/detecting-a-short-url-using-python

It sounds tough to implement. I can try to come up with a list of known short urls or match on a regular expression. I will most likely need some help with the cleaning and validation of results.