CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
552 stars 125 forks source link

Extract descriptions from Reddit posts #144

Closed vr25 closed 2 years ago

vr25 commented 2 years ago

Hi,

Thank you for making the Reddit (subreddits) datasets available. These contain titles and replies (comments), but I couldn't find the descriptions of the titles. Is there any way to extract their descriptions as well and include them in the existing ConvoKit Reddit dataset?

Thanks!

calebchiam commented 2 years ago

The descriptions are already included in these datasets. Reddit corpora have two types of utterances: post utterances and comment utterances. The latter corresponds to user replies. The former corresponds to the text of the post itself.

vr25 commented 2 years ago

Thanks for your quick reply. I downloaded personalfinance corpus from this link. It has the following files: image

However, there are no post utterances (title description) in utterances.jsonl. File uploaded here

calebchiam commented 2 years ago

Hmm, how did you determine that there are no post utterances?

image

I opened the utterances.jsonl you sent. All the post utterances are located at the start of the file. You can verify for yourself that these utterances contain the text of the Reddit post. (Though if you checked the permalink url, most of them have probably since been removed / deleted given how old they are.)

vr25 commented 2 years ago

Ah, I see, that helps! Thanks a lot for the clarification. And how do I find nested comments? Would that be multiple "reply_to"?

vr25 commented 2 years ago

Ah, I see, that helps! Thanks a lot for the clarification. And how do I find nested comments? Would that be multiple "reply_to"?

Nvm, I figured it out. So, I will close this issue. Thanks a lot for your help!

calebchiam commented 2 years ago

No problem. I highly recommend using our ConvoKit package to load the corpora. It provides an easy-to-use Conversation abstraction (in this case, each Conversation corresponds to a Reddit thread) that has the list of post + comments in the thread, with methods for visualization and traversal.

vr25 commented 2 years ago

Alright @calebchiam ! Thank you again for your suggestion!