Some questions to review before talk -- alongside other issues

One person did the whole coding, while two others split up. Why this design decision? Doesn't it create a chance of "human bias"?
You said that human rating is not reliable. Then how come you use those ratings afterwards to evaluate the tools' performance?
While you're just measuring Cohen's kappa agreement rate, why do say that humans are not "consistent"
What makes this work separate from the previous sentiment analysis tool evaluation papers?
What future work do you have in mind? [ Reviewer told us to add future work, which we didn't care about)
The politeness tool was developed over short texts and a certain type of texts (change requests made in StackExchange and Wikipedia). Why do you think it is justified to use the tool in the SE domain?

DeveloperLiberationFront / AffectAnalysisToolEvaluation