Closed bstarling closed 6 years ago
I'm a beginner looking to gain a bit more experience and I'd be willing to attempt this. How quickly do you expect a PR?
Hey @strongdan that sounds good. No time limit. The only request is if end up not finding time to finish you post back here to free it up for someone else. Feel free to post a partial solution so others can collaborate or drop by the #assemble channel with any questions.
Sounds great! I'll try to get something completed this weekend and update you with what I have.
I didn't see that @harish-garg already solved this one. Great job!
Still room for improvement or alternative approaches.
I had to post on StackOverflow about sorting out short URLs: http://stackoverflow.com/questions/43219063/detecting-a-short-url-using-python
It sounds tough to implement. I can try to come up with a list of known short urls or match on a regular expression. I will most likely need some help with the cleaning and validation of results.
Problem:
In order to do analysis on types of links being shared we need a reliable way to extract & count domains that appear in a list of URLs.
Tasks:
youtu.be
andyoutube.com
are bothyoutube
forums.website.com
the domain iswebsite
bit.ly
,t.co
You do not need a full solution in order to submit a PR. If you have questions drop in to assemble chat and see if anyone else is interested in working on the problem.
You can download the data here or load directly to pandas via
Post cleaning should generate a list of domains and their count. As well as a separate file of all shortened links where domain is not know. (Recommend you do not try to visit these shortened links)
warning: this work requires you deal with highly explicit and offensive content from the
pol
4chan board. Please do not visit the links you find as some may contain malware/offensive content.