Open reckart opened 3 years ago
@chmeyer are you still watching this repo? Is this a bug or a feature?
@reckart sort of, where time permits :)
I would not recommend this fix. Although Krippendorff's measure internally uses disagreement modeling (i.e., observed disagreement D_O and expected disagreement D_E, it is still defined as an agreement measure. This is achieved by the "1 - " term in the result calculation "1 - (D_O / D_E)". That's why the method is called "calculateAgreement" rather than "calculateDisagreement".
Example: Imagine, we see an observed disagreement of 0.5 (~ half of the annotations are "wrong") and due to the annotations, we would expect a disagreement of 0.57 (this happens for example in a 2-raters, 4-items AA, AB, BA, BB study). This means that the raters did only slightly better than chance, so we see alpha = 1 - (0.5 / 0.57) = 1 - 0.88 = 0.12. If the raters produce only an observed disagreement of 0.25, then they do clearly better than chance and we would obtain alpha = 1 - (0.25 / 0.57) = 1 - 0.43 = 0.57. To reach acceptable agreement levels, the raters would need to produce even less observed disagreement (or the expected disagreement raises).
Coming back to your question, if there is no observed disagreement and no expected disagreement, this would mean that we have an empty study and thus nothing to judge. Returning an alpha = 1 agreement would be misleading IMHO. Returning alpha = 0 as it is now, is also debateable, as there is no clear definition for the situation. Thus, NaN would be an option, but for practicality reasons (e.g., writing numbers in a database, computing averages, etc.), we did choose 0 in the first place and probably that's fine to keep.
What do you think? Best wishes!
In my case, I found that if I have two annotators who both annotate the same unit with the same label, then the expected and observed disagreement are both 0 and in the current code this causes the agreement to be reported as 0 - but it is full agreement and thus should be reported as 1.
Coming back to your question, if there is no observed disagreement and no expected disagreement, this would mean that we have an empty study and thus nothing to judge.
So the study is not necessarily empty if expected/observed disagreement are both 0.
Maybe?
if (D_O == D_E) {
return study.isEmpty() ? 0.0 : 1.0;
}
Well, D_O can be 0.0 if the raters agree on all items. But D_E does not fall 0 in a proper study. It can be 0 if there is only a single label, but then there is nothing to agree on, i.e. no question. In my opinion, this would not a real annotation study. But if we want support this use case, then, yes, the study.isEmpty solution should be a way.
I assume you mean by "real" study that there is a significant number of annotations :) In INCEpTION/WebAnno, we use calculate pairwise agreement between annotators. It is not uncommon to have cases where
We can and probably should handle the first case (no items) directly in our code telling the users that there was no data to compare. However, the second case I think be better handled here.
Thanks for the feedback!
Few labels is not the actual problem: the simplest case AA BB (2 raters agree on 2 items) returns alpha = 1. Also cases with 4 items work well and give a good agreement notion that captures the uncertainty, e.g., in AA AB AA BB (alpha = 0.53). But if there is nothing to decide, i.e. there is only a single label, then we could have 1000 items that are annotated with A by both raters without being able to tell the agreement as there is no expectation model. I am OK to set such cases to 1, but still they should be taken with a grain of salt.
The
KrippendorffAlphaAgreement
is a disagreement measure. If there is a full agreement, then the expected and observed disagreement is calculated as0.0
inHowever, a disagreement of 0 in this case does not yield an agreement of 1.0 but instead an agreement of 0.0.... seems wrong?
Suggested fix: