Closed ejhumphrey closed 6 years ago
If it helps, back in 2010-11 someone at CF told me "We measure individual worker quality quite simply by their accuracy on gold units. When aggregating, we simply weight each worker by their accuracy and take the answer with the highest weighted majority vote. What works well is that we do not let people continue on our tasks if their gold accuracy is below 70% (or another specified threshold)"
As for aggregating or not, regardless of what you do, please share raw annotations.
hm, okay, thanks! I wonder if that algorithm has shifted any in the last several years.
re: "raw" annotations, @julian-urbano do you have thoughts on the proposed columns?
Last time I used CF (maybe 2013?) they provided separate files for workers' info and for their answers to the units, like this. I'd fine it very useful to have all that if possible, but if not, what you proposed looks fine.
cool, the only other thing potentially worth folding in is channel
, perhaps in some kind of anonymized form? otherwise, let's start with what I've proposed and we can revisit it later if need be.
@bmcfee do you have any opinions before I do this?
@ejhumphrey not specifically; all of the above sounds good to me. As you say, it would be good to get some explicit confirmation from CF about where that rating comes from/whether the process has changed since 2013.
To whatever extent we can anonymize / protect the annotators' information, we should do so. I suspect that's not too big of a deal here, but it's worth considering before moving forward on releasing granular annotation data.
agreed, I was (implicitly) planning a non-deterministic one-way mapping for worker IDs, could be worth doing for channel as well. The remaining info named above isn't personal.
this will be addressed by #23
fixed as of #23
I was just working through some dataset integrity checks, and noticed that the confidence / relevance score reported from CrowdFlower isn't only a function of
num_responses
.Looking at the raw data, it could be a mix of trusted / untrusted judgments, e.g. when an annotator dropped below a confidence level and was removed from the job, or if confidence is weighted by
trust
(whether this is intra- or inter-task, I'm not sure).This raises at least two questions:
[sample_key, annotator_id, trust, instrument, response]