Closed sophieball closed 4 years ago
As a general note, proper length normalization is a broader NLP problem that harder than it might sound (I am not aware of a perfect solution).
Depending on your use case, here are a couple of imperfect solutions you might consider if the average length of the sentences are comparable between the two datasets (i.e., the difference in utterance length comes from the fact that in one dataset some utterances have more sentences, not from the fact that the sentences are longer):
a) subsample one sentence per utterance and use those to extract politeness features (and generate the plots). You can do this easily in convokit this way: create a utterance metadata field that contains one (random) sentence of each utterance (call that "random_sentence"). Run the TextParser transformer by setting the input_field= random_sentence. Run the Politeness transformer as before.
b) Macroaverage counts of politeness strategies at the sentence level. This would involve writing a custom summarize function, but should not be that hard.
If the difference in utterance lengths comes from sentence length, not from number of sentences within each utterance, then you might try to compare apples and oranges. One (again, imperfect) solution is to match each utterance in one corpus with an utterance in the other corpus that has a similar length. In convokit you could do this by merging the corpora and then using the Pairer Transformer.
Of course, you will have to make sure to keep in mind the imperfections of the solution you choose when you interpret your results. Good luck.
PS: Consider contributing the corpora you are working with to ConvoKit (since you already formatted them).
I have two different corpora, one of which has longer utterances than the other. If I want to compare the politeness histograms showed in this example, since it shows count/utterance, the corpus with longer utterances will have higher bars. What's the best way to mitigate this problem?