P0: the relative observed agreement among raters
Pe: the hypothetical probability of chance agreement
N: number of observations
k: number of categories
Nki: the number of times the rater i predicted category k
(In binary classification confusion matrix)
Assumption
It has been explicitly stated that Kappa should only be used to measure the agreement among human annotators instead of human and machine. A possible explanation is that it assumes different human annotators have similar precision and sensitivity. Otherwise, the distribution would be skewed. See more detailed explanations on StackOverflow and this report.
Cohen's kappa assumes all items are annotated by the same annotator
Fleiss's kappa only assumes all items are annotated the same number of times.
A dominant tag or sparse tags may skew the results.
Results interpretation
An acceptable kappa values varies based on the context. A good way is to look into how others use kappa scores in similar annotation tasks. For example, in POS tagging tasks, the score is usually high (>0.8)
≤ 0: as indicating no agreement
0.01–0.20: as none to slight
0.21–0.40: as fair
0.41– 0.60: as moderate (0.4 is the _acceptable_ cutoff)
0.61–0.80: as substantial (0.7 is the _good_ cutoff)
0.81–1.00: as almost perfect agreement
How to handle a poor interpretation?
Is it really a poor score? The more categories, the more likely the score will be lower.
What is the average score in other similar annotation tasks?
A not-so-good Fleiss's kappa can happen, because it usually includes crowdsourcing annotators, from whom the annotation quality can't be guaranteed.
If indeed a poor score, could improving the annotation guideline help?
Poor initial kappa is common.
Pros and Cons
Pros
Compared with percentage agreements
Kappa score corrects for chance agreement, therefore it is more robust. An extreme example is two psychiatrists randomly classifying 10% of the patients as "diseased", and 90% as "healthy", then their agreement would be 0.1 0.1 + 0.9 0.9 = 82%. But this high agreement means nothing.
(Relatively) Good at multi-class and imbalanced dataset problems.
Cons
Easy to get a high score in a balanced dataset.
the Kappa paradox due to symmetrical unbalanced datasets and asymmetrical unbalanced datasets.
Kappa score
A statistical measure of inter-rater reliability for categorical variables.
For assessing the agreement between raters on the performance of a classification model. (inter-tagger agreement, IAA)
Definition
(In binary classification confusion matrix)
Assumption
It has been explicitly stated that Kappa should only be used to measure the agreement among human annotators instead of human and machine. A possible explanation is that it assumes different human annotators have similar precision and sensitivity. Otherwise, the distribution would be skewed. See more detailed explanations on StackOverflow and this report.
Cohen's kappa assumes all items are annotated by the same annotator Fleiss's kappa only assumes all items are annotated the same number of times.
A dominant tag or sparse tags may skew the results.
Results interpretation
An acceptable kappa values varies based on the context. A good way is to look into how others use kappa scores in similar annotation tasks. For example, in POS tagging tasks, the score is usually high (>0.8)
How to handle a poor interpretation?
Pros and Cons
Pros
Compared with percentage agreements Kappa score corrects for chance agreement, therefore it is more robust.
An extreme example is two psychiatrists randomly classifying 10% of the patients as "diseased", and 90% as "healthy", then their agreement would be 0.1 0.1 + 0.9 0.9 = 82%. But this high agreement means nothing.
(Relatively) Good at multi-class and imbalanced dataset problems.
Cons
Easy to get a high score in a balanced dataset.
the Kappa paradox due to symmetrical unbalanced datasets and asymmetrical unbalanced datasets.
Implementation
Python packages: Caret, weka, scikit-learn
Useful reading
Natural language annotation for machine learning, Chapter 6