CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
542 stars 120 forks source link

How to construct corpus from raw Reddit dataset? #182

Closed TaoRuan-Campus closed 1 year ago

TaoRuan-Campus commented 1 year ago

corpus = Corpus(filename=download("reddit-corpus-small"))

Could you please give an example of how to construct the Reddit corpus such as"reddit-corpus-small"?

calebchiam commented 1 year ago

I don't have a code example for you, but a high level, the construction process was something like:

  1. Pick X popular subreddits
  2. Sample Y conversations from each subreddit corpus within a given time period, where each conversation has at least N utterances
  3. Then take these conversations / utterances and put them in a single corpus.

reddit-corpus-small did this with X=100, Y=100 and N=100. One peculiarity is that it defines conversations as starting from a top-level comment, whereas in the subreddit corpora themselves, the conversation starts from the Reddit post.