CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
556 stars 129 forks source link

Compare politeness histograms across different corpora #60

Closed sophieball closed 4 years ago

sophieball commented 4 years ago

I have two different corpora, one of which has longer utterances than the other. If I want to compare the politeness histograms showed in this example, since it shows count/utterance, the corpus with longer utterances will have higher bars. What's the best way to mitigate this problem?

cristiandnm commented 4 years ago

As a general note, proper length normalization is a broader NLP problem that harder than it might sound (I am not aware of a perfect solution).

Depending on your use case, here are a couple of imperfect solutions you might consider if the average length of the sentences are comparable between the two datasets (i.e., the difference in utterance length comes from the fact that in one dataset some utterances have more sentences, not from the fact that the sentences are longer):

a) subsample one sentence per utterance and use those to extract politeness features (and generate the plots). You can do this easily in convokit this way: create a utterance metadata field that contains one (random) sentence of each utterance (call that "random_sentence"). Run the TextParser transformer by setting the input_field= random_sentence. Run the Politeness transformer as before.

b) Macroaverage counts of politeness strategies at the sentence level. This would involve writing a custom summarize function, but should not be that hard.

If the difference in utterance lengths comes from sentence length, not from number of sentences within each utterance, then you might try to compare apples and oranges. One (again, imperfect) solution is to match each utterance in one corpus with an utterance in the other corpus that has a similar length. In convokit you could do this by merging the corpora and then using the Pairer Transformer.

Of course, you will have to make sure to keep in mind the imperfections of the solution you choose when you interpret your results. Good luck.

PS: Consider contributing the corpora you are working with to ConvoKit (since you already formatted them).