acmsigsoft / EmpiricalStandards

Tools and standards for conducting and evaluating research in software engineering
https://acmsigsoft.github.io/EmpiricalStandards/
Creative Commons Zero v1.0 Universal
293 stars 62 forks source link

FAQ "Why are the “essentials” just Boolean. Isn’t it too simplistic?" #101

Closed guerzh closed 1 year ago

guerzh commented 1 year ago

"Yes/no questions will lead to the highest inter-rater reliability"

I might be confused, but I don't think it's true. Imagine the "true" ratings are all around 0.5/1. Binary raters slightly biased towards 1 might rate everything 1, and binary raters slightly biased towards 0 might rate everything 0, resulting in low interrater variability.

On the other hand slight bias upward or downward will not affect continuous ratings.

If I'm right, this should be at the very least qualified to state when it is the case that binary ratings lead to grater interrater reliability.

It is also not clear to me that interrater reliability is a valuable goal. In fact, a side effect of getting diverse perspectives from reviewers will result in less interrater reliability.

drpaulralph commented 1 year ago

It may be possible, but Kitchenham did a study with reviewing checklists with 5-point scales and got much lower reliability than we have achieved with binary scales. Binary scales are recommended in the case survey literature to improve reliability. See Bullock and Tubbs paper referenced in the case survey standard.

Interrater reliability is a valuable goal because the lower the reliability, the more acceptance is determined by review selection rather than the quality of the paper. If reliability is zero, the content of the paper is irrelevant to the decision outcome. Peer review becomes a lottery. There is no domain of expert judgment, or measurement for that matter, where low reliability is considered valuable. Only in peer review do people make this argument, and it just doesn't hold up.