Data4Democracy / assemble

NOT AN ACTIVE PROJECT -- Check readme for data sources
MIT License
36 stars 27 forks source link

4chan link extraction & cleanup #48

Closed bstarling closed 7 years ago

bstarling commented 7 years ago

Problem:

Ex:

"com": "<a href=\"#p116190305\" class=\"quotelink\">&gt;&gt;116190305</a>
<br>redacted are simply jealous of redacted.<br>https://youtu.be/k4yXQkG2s1E"

Additional info:

Post cleaning the above should generate something along the lines of the below (use your own judgement after playing with the data):

{
    "text": "redacted are simply jealous of redacted",
    "external_links": ["https://youtu.be/k4yXQkG2s1E"]
}

warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board.

carol-tonight commented 7 years ago

Starting this

bstarling commented 7 years ago

Issue is still open for anyone looking to get started.

subbuvenk commented 7 years ago

If nobody is assigned this task, I would love to try my hand at it. This will be my first attempt at contributing to D4D.

bstarling commented 7 years ago

Sounds good @subbuvenk94. @carolph3232 if you find time during the weekend hackathon feel free to drop into chat and tag team.

carol-tonight commented 7 years ago

@subbuvenk94 I've got a pretty good start on this, but it's not perfect. I'll submit a PR so you can see what I've done and we can collaborate

update: here's the PR https://github.com/Data4Democracy/assemble/pull/55

subbuvenk commented 7 years ago

@carolph3232 Nice work there! I think you have it covered all by yourself. I didn't see this earlier, my bad. Thanks for the offer to collaborate 👍