FAQ "Why are the “essentials” just Boolean. Isn’t it too simplistic?"

acmsigsoft / EmpiricalStandards

Tools and standards for conducting and evaluating research in software engineering

Creative Commons Zero v1.0 Universal

293 stars 62 forks source link

"Yes/no questions will lead to the highest inter-rater reliability"

I might be confused, but I don't think it's true. Imagine the "true" ratings are all around 0.5/1. Binary raters slightly biased towards 1 might rate everything 1, and binary raters slightly biased towards 0 might rate everything 0, resulting in low interrater variability.

On the other hand slight bias upward or downward will not affect continuous ratings.

If I'm right, this should be at the very least qualified to state when it is the case that binary ratings lead to grater interrater reliability.

It is also not clear to me that interrater reliability is a valuable goal. In fact, a side effect of getting diverse perspectives from reviewers will result in less interrater reliability.

It may be possible, but Kitchenham did a study with reviewing checklists with 5-point scales and got much lower reliability than we have achieved with binary scales. Binary scales are recommended in the case survey literature to improve reliability. See Bullock and Tubbs paper referenced in the case survey standard.

Interrater reliability is a valuable goal because the lower the reliability, the more acceptance is determined by review selection rather than the quality of the paper. If reliability is zero, the content of the paper is irrelevant to the decision outcome. Peer review becomes a lottery. There is no domain of expert judgment, or measurement for that matter, where low reliability is considered valuable. Only in peer review do people make this argument, and it just doesn't hold up.

acmsigsoft / EmpiricalStandards

FAQ "Why are the “essentials” just Boolean. Isn’t it too simplistic?" #101