How to construct corpus from raw Reddit dataset?

CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.

MIT License

556 stars 129 forks source link

I don't have a code example for you, but a high level, the construction process was something like:

Pick X popular subreddits
Sample Y conversations from each subreddit corpus within a given time period, where each conversation has at least N utterances
Then take these conversations / utterances and put them in a single corpus.

reddit-corpus-small did this with X=100, Y=100 and N=100. One peculiarity is that it defines conversations as starting from a top-level comment, whereas in the subreddit corpora themselves, the conversation starts from the Reddit post.

CornellNLP / ConvoKit

How to construct corpus from raw Reddit dataset? #182