Closed vr25 closed 2 years ago
The descriptions are already included in these datasets. Reddit corpora have two types of utterances: post utterances and comment utterances. The latter corresponds to user replies. The former corresponds to the text of the post itself.
Hmm, how did you determine that there are no post utterances?
I opened the utterances.jsonl you sent. All the post utterances are located at the start of the file. You can verify for yourself that these utterances contain the text of the Reddit post. (Though if you checked the permalink url, most of them have probably since been removed / deleted given how old they are.)
Ah, I see, that helps! Thanks a lot for the clarification. And how do I find nested comments? Would that be multiple "reply_to"?
Ah, I see, that helps! Thanks a lot for the clarification. And how do I find nested comments? Would that be multiple "reply_to"?
Nvm, I figured it out. So, I will close this issue. Thanks a lot for your help!
No problem. I highly recommend using our ConvoKit package to load the corpora. It provides an easy-to-use Conversation abstraction (in this case, each Conversation corresponds to a Reddit thread) that has the list of post + comments in the thread, with methods for visualization and traversal.
Alright @calebchiam ! Thank you again for your suggestion!
Hi,
Thank you for making the Reddit (subreddits) datasets available. These contain titles and replies (comments), but I couldn't find the descriptions of the titles. Is there any way to extract their descriptions as well and include them in the existing ConvoKit Reddit dataset?
Thanks!