CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
556 stars 129 forks source link

Merge Two Corpus #70

Closed TaoRuan-Campus closed 4 years ago

TaoRuan-Campus commented 4 years ago

I am wondering whether there is going to be a standard method to merge two corpus if the format of them is the same?

calebchiam commented 4 years ago

You can use corpus.merge() or corpus.add_utterances(), the latter is more appropriate if you're more interested in adding utterances from one corpus into another (and less interested in other non-Utterance metadata). In addition, for add_utterances() if you know the utterances in your two corpora are disjoint, you can set with_checks=False to have this run faster.

TaoRuan-Campus commented 4 years ago

Thank you for your reply! I am trying to use your method but another problem occurs. As we know, different jupyter notebooks can share data like https://www.thetopsites.net/article/50952105.shtml However, when I store Corpus in one notebook I am not able to read the Corpus in another notebook. Do you happen to know how to solve this problem?

TaoRuan-Campus commented 4 years ago

@calebchiam Thank you again. I decided to use dump function to share data finally. However, when I use add_utterances()function it tells me 'Corpus' object has no attribute 'speaker'. Actually I set the speaker as generic_speaker by speaker = Speaker(id="speaker") Could you please provide some advice on how to solve this?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-823-ec637f12161e> in <module>
      1 # FightinCorpus_Emergency_Reddit
----> 2 FightinCorpus_Emergency_Twitter.add_utterances([FightinCorpus_Emergency_Reddit],with_checks=False)

~/Library/Python/3.7/lib/python/site-packages/convokit/model/corpus.py in add_utterances(self, utterances, warnings, with_checks)
    855             return self.merge(helper_corpus, warnings=warnings)
    856         else:
--> 857             new_speakers = {u.speaker.id: u.speaker for u in utterances}
    858             new_utterances = {u.id: u for u in utterances}
    859             for speaker in new_speakers.values():

~/Library/Python/3.7/lib/python/site-packages/convokit/model/corpus.py in <dictcomp>(.0)
    855             return self.merge(helper_corpus, warnings=warnings)
    856         else:
--> 857             new_speakers = {u.speaker.id: u.speaker for u in utterances}
    858             new_utterances = {u.id: u for u in utterances}
    859             for speaker in new_speakers.values():

AttributeError: 'Corpus' object has no attribute 'speaker'
calebchiam commented 4 years ago

Hi @TaoRuan-Campus, add_utterances() takes in a list of utterances, not a Corpus. You can try passing in list(FightinCorpus_Emergency_Reddit.iter_utterances()). Read our documentation for more details: https://convokit.cornell.edu/documentation/corpus.html

TaoRuan-Campus commented 4 years ago

Thanks @calebchiam I figured it out. Actually the speaker problem still exists and I seem to have to construct the corpus_speakers dictionary to provide the speaker infomation instead of using generic_speaker to make it work. I might not be correct but it seems that the speaker information is a must in add_utterances().

calebchiam commented 4 years ago

That doesn't sound right actually. You should be able to construct two separate corpora and merge them as is if they already have Speaker information present for all their Utterances. (A generic_speaker should be sufficient.) I'm guessing that something might have gone wrong to the corpus construction process (you might want to inspect your utterances), but otherwise great if it works!

TaoRuan-Campus commented 4 years ago

@calebchiam Thank you and I will check later on. Another question I got is in the documentation when using Fightin's words method (https://convokit.cornell.edu/documentation/tutorial.html) it seems that the raw text data are fed into the algorithm. I am wondering whether it is the standard way to compare two corpus? Is preprocessing(remove stop words, stemming, etc.) necessary for it?

calebchiam commented 4 years ago

To compare two corpora, you'd typically merge the two corpora and distinguish one corpora's utterances from another using a field in the utterance metadata, e.g. utterance.meta['corpora'] = 'corpora1'. Then you can set the classes during the FightingWords fit step.

Preprocessing is up to you, though the FightingWords algorithm is designed to make stopwords less common as the most salient terms (you may read the paper for more details).

TaoRuan-Campus commented 4 years ago

Thanks!