crowdresearch / collective

The Stanford crowd research collective
http://crowdresearch.stanford.edu
4 stars 7 forks source link

Perform quality comparisons to AMT #14

Open mbernst opened 6 years ago

mbernst commented 6 years ago

AI researchers want to know whether our quality is a step up from Mechanical Turk before spending time using our platform. Since AI researchers are putting a substantial proportion of our work on the platform currently, I think we should answer this definitively and use it to recruit them as requesters.

My proposal

I would like to run a series of comparison experiments where we launch the same task on AMT and Daemo and compare the results on known ground-truth. Based on a suggestion from @dmorina, I would start with the effort-responsive task that @shirishgoyal goyal created. I would then try some others (e.g., transcription, labeling) that are common benchmarks.

If this goes well, I will suggest changes to the homepage that emphasize the quality difference.


Use comments to share your response or use emoji 👍 to show your support. To officially join in, add yourself as an assignee to the proposal. To break consensus, comment using this template. To find out more about this process, read the how-to.

qwertyone commented 6 years ago

I do not see how to self-assign. How is this done? This is useful for the major others in the future to know.

markwhiting commented 6 years ago

@qwertyone you should now have an invite to be able to do that.

qwertyone commented 6 years ago

To join the issues, the interested party must mention their intent to be involved and then an administrator invites them. This is how the process works now.

From: Mark Whiting Sent: Sunday, 17 December 2017 17:48 To: crowdresearch/collective Cc: qwertyone; Mention Subject: Re: [crowdresearch/collective] Perform quality comparisons to AMT(#14)

@qwertyone you should now have an invite to do that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

markwhiting commented 6 years ago

@qwertyone we are now able to assign our selfs.

When I was initially testing things out thats how it had to work, but with some modifications, i.e. using this /collective repository, we are able to make it possible for community members to add themselves as assignees. I will update that help document.

mbernst commented 6 years ago

I'd like to make sure this proposal aligns with the requirements. So adding a few more details:

Slack username of the person or people who created the proposal

@Michael Bernstein

What problem is being solved? If feasible, specific evidence or examples (e.g., screen shots, data) in support of the understanding of the problem.

We have been discussing for the last few months our plan to get more requesters using the system. I have been speaking with AI/ML/data folks around Stanford, and one question that keeps coming up is how our quality compares to AMT. This is also fundamental to supporting our claim that Boomerang helps in the deployed platform.

If we can definitively answer the question of quality on benchmark tasks, I think it will be far easier to get requesters onto the platform.

Short-term (and, if relevant, long-term) implications of executing this proposal

It is possible that the answer of whether our quality is higher will be one we are not happy to hear. :) If so, we will need to think hard about how to improve. But, I think as researchers, we need to be open to the truth no matter how it comes out.

qwertyone commented 6 years ago

@mbernst How will the different aspects of quality be identified, then weighted for direct comparison? We could create an index (speed of payment x = index.... ya da ya da) or a modified house of quality here.

mbernst commented 6 years ago

Summary

We have agreed that we will run quality comparisons on benchmark tasks against AMT. If it goes well, we will propose changes to the homepage to emphasize this.

(N.B. another AI researcher asked me today to quantify the quality difference, so this seems a worthwhile exercise :) )


As per discussion today, the goal moving forward will be to have this summary prior to the weekly meeting, so everyone knows what has been agreed upon.

mbernst commented 6 years ago

@mbernst How will the different aspects of quality be identified, then weighted for direct comparison? We could create an index (speed of payment x = index.... ya da ya da) or a modified house of quality here.

The norm in this area is to measure quality as % correct on tasks with known ground truth. We did this for the guilds paper, and the plan is to start there with this as well.