Closed TaoRuan-Campus closed 4 years ago
You can use corpus.merge()
or corpus.add_utterances()
, the latter is more appropriate if you're more interested in adding utterances from one corpus into another (and less interested in other non-Utterance metadata). In addition, for add_utterances()
if you know the utterances in your two corpora are disjoint, you can set with_checks=False
to have this run faster.
Thank you for your reply! I am trying to use your method but another problem occurs. As we know, different jupyter notebooks can share data like https://www.thetopsites.net/article/50952105.shtml However, when I store Corpus in one notebook I am not able to read the Corpus in another notebook. Do you happen to know how to solve this problem?
@calebchiam Thank you again. I decided to use dump
function to share data finally. However, when I use add_utterances()
function it tells me 'Corpus' object has no attribute 'speaker'. Actually I set the speaker as generic_speaker by speaker = Speaker(id="speaker")
Could you please provide some advice on how to solve this?
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-823-ec637f12161e> in <module>
1 # FightinCorpus_Emergency_Reddit
----> 2 FightinCorpus_Emergency_Twitter.add_utterances([FightinCorpus_Emergency_Reddit],with_checks=False)
~/Library/Python/3.7/lib/python/site-packages/convokit/model/corpus.py in add_utterances(self, utterances, warnings, with_checks)
855 return self.merge(helper_corpus, warnings=warnings)
856 else:
--> 857 new_speakers = {u.speaker.id: u.speaker for u in utterances}
858 new_utterances = {u.id: u for u in utterances}
859 for speaker in new_speakers.values():
~/Library/Python/3.7/lib/python/site-packages/convokit/model/corpus.py in <dictcomp>(.0)
855 return self.merge(helper_corpus, warnings=warnings)
856 else:
--> 857 new_speakers = {u.speaker.id: u.speaker for u in utterances}
858 new_utterances = {u.id: u for u in utterances}
859 for speaker in new_speakers.values():
AttributeError: 'Corpus' object has no attribute 'speaker'
Hi @TaoRuan-Campus, add_utterances()
takes in a list of utterances, not a Corpus. You can try passing in list(FightinCorpus_Emergency_Reddit.iter_utterances())
. Read our documentation for more details: https://convokit.cornell.edu/documentation/corpus.html
Thanks @calebchiam
I figured it out. Actually the speaker
problem still exists and I seem to have to construct the corpus_speakers
dictionary to provide the speaker infomation instead of using generic_speaker
to make it work. I might not be correct but it seems that the speaker information is a must in add_utterances()
.
That doesn't sound right actually. You should be able to construct two separate corpora and merge them as is if they already have Speaker information present for all their Utterances. (A generic_speaker
should be sufficient.) I'm guessing that something might have gone wrong to the corpus construction process (you might want to inspect your utterances), but otherwise great if it works!
@calebchiam Thank you and I will check later on. Another question I got is in the documentation when using Fightin's words method (https://convokit.cornell.edu/documentation/tutorial.html) it seems that the raw text data are fed into the algorithm. I am wondering whether it is the standard way to compare two corpus? Is preprocessing(remove stop words, stemming, etc.) necessary for it?
To compare two corpora, you'd typically merge the two corpora and distinguish one corpora's utterances from another using a field in the utterance metadata, e.g. utterance.meta['corpora'] = 'corpora1'
. Then you can set the classes during the FightingWords fit step.
Preprocessing is up to you, though the FightingWords algorithm is designed to make stopwords less common as the most salient terms (you may read the paper for more details).
Thanks!
I am wondering whether there is going to be a standard method to merge two corpus if the format of them is the same?