Open gati opened 7 years ago
I'll be working on this today, slack handle @hipplec.
Is this one of the datasets on the s3 bucket or one of the private data.world sets?
If anyone is interested in progress throughout the day I'll make sure to push my commits to my fork often.
Awesome, thanks @C-Hipple! We haven't been sure what's valuable/interesting in the 4chan data, so it'll be really helpful to have a sense of what's in there.
I put a pr to the assemble repo here with two notebooks, one for addressing this issue on exploring/cleaning the sample 4chan dataset in the s3 bucket and another for aggregating the bidaily scrapes of json files in the bucket into a dataframe to make analysis more simple for others.
The D4D community has acquired a few million recent 4chan posts. This issue is to explore that data, using techniques such as topic modeling, network analysis, social media analytics (who posts most often, what times of day are popular, whose posts get the most replies, average comment thread lengths, etc).
Fair warning: This content will likely be a little gross, because it's 4chan.