Usernames seem to all be the same coming out of pairwise_exchanges

CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.

MIT License

550 stars 125 forks source link

corpus = convokit.Corpus(filename=os.path.join(os.getcwd(), 'datasets', 'redditconvos.json')) exchanges = corpus.pairwise_exchanges(user_names_only=True) print('loaded ', len(exchanges) ,' exchanges') for i in exchanges: exchanges[i] = [e for e in exchanges[i] if e.text != "[deleted]"] # filter out [deleted]'s tlen = len(exchanges[i]) if(tlen > 1): for j in range(tlen): print(str(exchanges[i][j].user) + ": " + exchanges[i][j].text) print("><><><><><><><")

User([('name', 'Borkz')]): Make paragraphs and maybe ill read that. User([('name', 'Borkz')]): Decided against reading it. User([('name', 'Borkz')]): And stop running everything you say through a thesaurus. User([('name', 'Borkz')]): Nice. User([('name', 'Borkz')]): > > but It should encapsulate all kinds of drug stories recreational and otherwise

Hi, thanks for raising this issue. This has been fixed as of the recent Convokit 2.0 release. We also have a reddit-corpus-small corpus instead of redditconvos.json that users can more readily utilize.

Adapted from your code above, this will now produce what you expect:

import convokit

corpus = convokit.Corpus(filename=convokit.download("reddit-corpus-small"))
exchanges = corpus.pairwise_exchanges(user_names_only=True)
print("Loaded {} exchanges".format(len(exchanges)))

for i in exchanges:
    exchanges[i] = [e for e in exchanges[i] if e.text != "[deleted]"]   # filter out [deleted]'s
    tlen = len(exchanges[i])
    if len(exchanges[i]) > 1:
        for j in range(tlen):
            print()
            print(str(exchanges[i][j].user) + ": " + exchanges[i][j].text)
            print()
        print("><><><><><><><")

Output:


User([('name', 'rheinl')]): “Hi Kpop girl can I send you a pic of g-dragon?”

“Uhhh ok” (what the hell? this guy is so weird)

“Here you go” *sends pic of g-dragon with no makeup*

“Uhh thanks I guess” (Jesus I hope he never talks to me again)

User([('name', 'rheinl')]): “Hi Kpop girl can I send you a pic of g-dragon”

“Woohoo I can’t wait! Please send it right away!”

“Here you go” *sends a pic of g-dragon with no make up*

“Ugh! No you spoilt my day! Hehe!”

><><><><><><><

User([('name', 'belmont_lay')]): I'm sorry if that's how your interactions with friends go 😥

User([('name', 'belmont_lay')]): I'm sorry if that's how your interactions with friends go 😥

But that's not how any of my whatsapps with my friends happened.

><><><><><><><

Please do let us know if you encounter any other issues. Thank you!

CornellNLP / ConvoKit

Usernames seem to all be the same coming out of pairwise_exchanges #20