DeveloperLiberationFront / AffectAnalysisToolEvaluation

SEmotion_18 paper on evaluating the reliability of sentiment and politeness analysis tools
3 stars 4 forks source link

Complains against the Politeness tool #11

Closed nasifimtiazohi closed 6 years ago

nasifimtiazohi commented 6 years ago

R1

3 - In section 3.4 author state that "It was trained and tested on different corpora and hence has the claim to be domain independent. It returns a politeness score between 0 to 1 for each texts with 1 being the most polite." This is not correct since the tool returns a label polite/impolite along with a confidence level of the given label that range from 0 to 1.

5 - The politeness tool is trained to classify "question and answer" containing exactly 2 sentences and not general text. Providing text not in this form may yield to biased results.The tool provide also an utility to classify a text to be in the "question and answer" form which can be used to filter out comment not in that form.

6 - The politeness tool was trained with 10k requests rated by about 400 raters that have been accurately chosen using Amazon Mechanical Torque and each request wan annotated by 5 raters. In this study author rated 598 comments (not requests) and each comment was rated by 2 raters. This lead to unreliable results to me, even I do agree to the need of a reliable evaluation of the politeness tool in SE.

nasifimtiazohi commented 6 years ago

3- I remember having a confusion in this regard. However, the tool definitely returns only score (not any label) between 0 to 1 for which the variable is named prob[polite]. There is also prob[impolite] which is just 1-prob[polite]. The original paper also states,

For new requests, we use class probability estimates obtained by fitting a logistic regression model to the output of the SVM (Witten and Frank, 2005) as predicted politeness scores (with values between 0 and 1; henceforth politeness, by abuse of language).

But they don't mention a threshold, that's why we had to figure out the threshold.

The confusion arises because on their website they give a label for each text with a confidence rating. Probably, some other paper might also say this, I don't clearly remember.

However, Jongeling's paper who also uses this politeness tool, states,

Given a textual fragment the Stanford politeness API returns a politeness score ranging between 0 (impolite) and 1 (polite) with 0.5 representing the “ideal neutrality”. To discretize the score into polite, neutral and impolite we apply the Stanford politeness API to the seven datasets above. It turns out that the politeness scores of the majority of comments are low: the median score is 0.314, the mean score is 0.361 and the third quartile (Q3) is 0.389. We use the latter value to determine the neutrality range. We say therefore that the comments scoring between 0.389 and 0.611 = 1 − 0.389 are neutral; comments scoring lower than 0.389 are impolite and comments scoring higher than 0.611 are polite.

note: Jongeling's paper does not manually rate politeness.

So, I'm pretty sure about our tool usage in the paper. The original paper is actually a bit confusing in this regard I would say.

nasifimtiazohi commented 6 years ago

5- This is an obvious threat. However, if you go through the original paper, the tool recognizes general patterns of politeness in written texts. Also, this is already in some use in SE research. So, the tool is not completely irrelevant.

6- I don't know why training and testing need to be aligned with. Having only two coders is definitely a threat though.

nasifimtiazohi commented 6 years ago

writing -

https://github.com/DeveloperLiberationFront/AffectAnalysisToolEvaluation/issues