Simple Real-Time Quality Control Mechanisms to Filter Out Bad Quality Work And/Or Workers

maddalihanumateja commented 6 years ago

This applies to cases when the total label count for a missions turns out to be zero. Ask for a manual confirmation on whether the user really didnt see any issues during the mission. We could also mark the street edge for re-auditing by another Turker and dont include the street edges associated with that mission for the landing page stats

jonfroehlich commented 6 years ago

Thanks @maddalihanumateja for creating this issue. For context, on Sept 13, we started to notice that one or more users were completing audits with very few (or zero) labels. This should almost never be the case. I don't know the calculation, but I think an average audit is something like ~50 labels or something.

Yesterday, we had 82 audits but only 164 labels (that's 2 labels/audit on average) and today we have 116 audits with only 18 labels (0.15 labels per audit).

So, it looks like we are seeing some vandalism on the site and this means that we need to have simple checks (which we can make more sophisticated in the future) to:

mark these audit routes for re-audit
proactively intervene with interactive dialogs that tell the user we have noticed that they are not auditing in the way we expecting (we'd have to work on the phrasing and the way to do this)
possibly prevent users from further auditing (in worst case)

Also, to deal with the existing bad data now in our database, we need to discuss what to do:

write a quality control checker that goes in and quickly validates the audit routes that have been completed and mark those that we suspect as low quality as needing to be re-audited.
start banning ip addresses?

misaugstad commented 6 years ago

This is also discussed in #615

misaugstad commented 6 years ago

I'm sure that we will start with an MVP, but for the future... I was just thinking that, as an input to some algorithm we write that tries to estimate the quality of an audit, we could use something like the distance of labels (of certain label types) from intersections.

The most obvious case would be if we were seeing curb ramps and missing curb ramps that are far from intersections, the audit might be marked as low quality. Obviously this should be verified using the GT that we now have, but I see it as another thing we could factor into a more sophisticated algorithm for getting high-quality audits.

misaugstad commented 6 years ago

So I see two similar, though slightly different problems here (at least in practice). There is filtering out work we suspect to be of low quality based on some algorithm/metrics, and then there is filtering out work that is deliberately done in bad faith. And I would like to be able to manually add users to the latter list.

So the workers that we determine to be auditing in bad faith (either by manual review, or algorithmically if we use an extreme threshold), we should just filter those labels/audits out from just about everything. We don't want to include their labels in the label count on our landing page, we don't want to return their labels via the API, and we don't want to include their labels when creating visualizations of our data.

Workers that we suspect may have low quality work based on some algorithm/metrics is a different story. We probably want to include their data in most of those places. When doing analysis or visualization, we may want to apply less weight to their labels, but we don't just want to completely disregard them at all times.

Another difference I see is that a user may be placed on the list of "possibly low quality work", and may then be pulled off the list later on, depending on how our model changes, or on how their labeling behavior changes as they continue. But for the list of workers auditing in bad faith, I don't really expect users to be removed from that list much at all.

I'm wondering if it would be a good idea to create some table in the database (or a column in the user table) that maintains a list of users we have verified as auditing in bad faith. I am building up such a list as I review HITs that we post to mturk, and I would like to do something with this list. Especially considering our model, which is very simple right now and would not filter out all such bad-faith users.

ProjectSidewalk / SidewalkWebpage

Simple Real-Time Quality Control Mechanisms to Filter Out Bad Quality Work And/Or Workers #1082