Closed bstarling closed 7 years ago
Starting this
Issue is still open for anyone looking to get started.
If nobody is assigned this task, I would love to try my hand at it. This will be my first attempt at contributing to D4D.
Sounds good @subbuvenk94. @carolph3232 if you find time during the weekend hackathon feel free to drop into chat and tag team.
@subbuvenk94 I've got a pretty good start on this, but it's not perfect. I'll submit a PR so you can see what I've done and we can collaborate
update: here's the PR https://github.com/Data4Democracy/assemble/pull/55
@carolph3232 Nice work there! I think you have it covered all by yourself. I didn't see this earlier, my bad. Thanks for the offer to collaborate 👍
Problem:
com
) field is pretty rough. It includes html markup and other random garbage.Ex:
Additional info:
df = pd.read_csv('https://s3.amazonaws.com/far-right/fourchan/chan_example.csv', parse_dates=['created_at'])
Post cleaning the above should generate something along the lines of the below (use your own judgement after playing with the data):
warning: this work requires you deal with highly explicit and offensive content from the
pol
4chan board.