Aggregated vs raw labels

cosmir / openmic-2018

Tools and tutorials for the OpenMIC-2018 dataset.

MIT License

91 stars 10 forks source link

Aggregated vs raw labels #21

Closed ejhumphrey closed 6 years ago

ejhumphrey commented 6 years ago

I was just working through some dataset integrity checks, and noticed that the confidence / relevance score reported from CrowdFlower isn't only a function of num_responses.

Looking at the raw data, it could be a mix of trusted / untrusted judgments, e.g. when an annotator dropped below a confidence level and was removed from the job, or if confidence is weighted by trust (whether this is intra- or inter-task, I'm not sure).

This raises at least two questions:

Are we happy to proceed with the confidence / relevance scores provided by CF?
Is there value in also sharing a more granular dataset of annotator responses? This table's columns could look like [sample_key, annotator_id, trust, instrument, response]

julian-urbano commented 6 years ago

If it helps, back in 2010-11 someone at CF told me "We measure individual worker quality quite simply by their accuracy on gold units. When aggregating, we simply weight each worker by their accuracy and take the answer with the highest weighted majority vote. What works well is that we do not let people continue on our tasks if their gold accuracy is below 70% (or another specified threshold)"

As for aggregating or not, regardless of what you do, please share raw annotations.

ejhumphrey commented 6 years ago

hm, okay, thanks! I wonder if that algorithm has shifted any in the last several years.

re: "raw" annotations, @julian-urbano do you have thoughts on the proposed columns?

julian-urbano commented 6 years ago

Last time I used CF (maybe 2013?) they provided separate files for workers' info and for their answers to the units, like this. I'd fine it very useful to have all that if possible, but if not, what you proposed looks fine.

ejhumphrey commented 6 years ago

cool, the only other thing potentially worth folding in is channel, perhaps in some kind of anonymized form? otherwise, let's start with what I've proposed and we can revisit it later if need be.

ejhumphrey commented 6 years ago

@bmcfee do you have any opinions before I do this?

bmcfee commented 6 years ago

@ejhumphrey not specifically; all of the above sounds good to me. As you say, it would be good to get some explicit confirmation from CF about where that rating comes from/whether the process has changed since 2013.

To whatever extent we can anonymize / protect the annotators' information, we should do so. I suspect that's not too big of a deal here, but it's worth considering before moving forward on releasing granular annotation data.

ejhumphrey commented 6 years ago

agreed, I was (implicitly) planning a non-deterministic one-way mapping for worker IDs, could be worth doing for channel as well. The remaining info named above isn't personal.

ejhumphrey commented 6 years ago

this will be addressed by #23

ejhumphrey commented 6 years ago

fixed as of #23