CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
550 stars 125 forks source link

Usernames seem to all be the same coming out of pairwise_exchanges #20

Closed SamanthaClarke1 closed 5 years ago

SamanthaClarke1 commented 5 years ago

Alright, so, here's my code.

corpus = convokit.Corpus(filename=os.path.join(os.getcwd(), 'datasets', 'redditconvos.json'))
exchanges = corpus.pairwise_exchanges(user_names_only=True)
print('loaded ', len(exchanges) ,' exchanges')

for i in exchanges:
    exchanges[i] = [e for e in exchanges[i] if e.text != "[deleted]"] # filter out [deleted]'s
    tlen = len(exchanges[i])
    if(tlen > 1):
        for j in range(tlen):
            print(str(exchanges[i][j].user) + ": " + exchanges[i][j].text)

    print("><><><><><><><")

Now, what I'm kinda expecting is a conversation between different users, right? But what I keep seeing is the same user having a conversation with, what appears to be... themselves?

Example:

User([('name', 'Borkz')]): Make paragraphs and maybe ill read that.
User([('name', 'Borkz')]): Decided against reading it.
User([('name', 'Borkz')]): And stop running everything you say through a thesaurus.
User([('name', 'Borkz')]): Nice.
User([('name', 'Borkz')]): &gt; &gt; but It should encapsulate all kinds of drug stories recreational and otherwise

I've tried this with both the reddit convo corpus, and the wiki corpus, but it seems like names within one thread are always the same.

calebchiam commented 5 years ago

Hi, thanks for raising this issue. This has been fixed as of the recent Convokit 2.0 release. We also have a reddit-corpus-small corpus instead of redditconvos.json that users can more readily utilize.

Adapted from your code above, this will now produce what you expect:

import convokit

corpus = convokit.Corpus(filename=convokit.download("reddit-corpus-small"))
exchanges = corpus.pairwise_exchanges(user_names_only=True)
print("Loaded {} exchanges".format(len(exchanges)))

for i in exchanges:
    exchanges[i] = [e for e in exchanges[i] if e.text != "[deleted]"]   # filter out [deleted]'s
    tlen = len(exchanges[i])
    if len(exchanges[i]) > 1:
        for j in range(tlen):
            print()
            print(str(exchanges[i][j].user) + ": " + exchanges[i][j].text)
            print()
        print("><><><><><><><")

Output:


User([('name', 'rheinl')]): “Hi Kpop girl can I send you a pic of g-dragon?”

“Uhhh ok” (what the hell? this guy is so weird)

“Here you go” *sends pic of g-dragon with no makeup*

“Uhh thanks I guess” (Jesus I hope he never talks to me again)

User([('name', 'rheinl')]): “Hi Kpop girl can I send you a pic of g-dragon”

“Woohoo I can’t wait! Please send it right away!”

“Here you go” *sends a pic of g-dragon with no make up*

“Ugh! No you spoilt my day! Hehe!”

><><><><><><><

User([('name', 'belmont_lay')]): I'm sorry if that's how your interactions with friends go 😥

User([('name', 'belmont_lay')]): I'm sorry if that's how your interactions with friends go 😥

But that's not how any of my whatsapps with my friends happened.

><><><><><><><

Please do let us know if you encounter any other issues. Thank you!