ProjectSidewalk / sidewalk-data-analysis

Holds all offline data analysis scripts for Project Sidewalk required for our forthcoming paper submission
3 stars 0 forks source link

Compute percent agreement for production dataset #6

Open misaugstad opened 6 years ago

misaugstad commented 6 years ago

Here is the algorithm that @manaswis and I came up with, as written by @manaswis on Slack:

For agreement/consensus, we want to be able to say that an object was marked as a problem by X users out of Y users who audited this street. So in our results, we can say "Amongst streets audited by multiple users, X% of labels have Y% agreement" and then we break it down by label type.

So to do this, for a street, we first have to get both the number of users who said there was a problem (for a cluster), and the number of users who said there was no problem. We decided to do this in the following way: for a label cluster (after doing single and multi user clustering), the number of users who have a label in the cluster is the number who said there was a problem. To get the number who said it wasn't a problem: we find which street is closest to that cluster, and then we count the number of users that placed any label on that street, but did not have a label in that cluster. Based on these number, we calculate the percent agreement for a problem cluster.

Although I have an idea for how we can simplify the algorithm, while still feeling confident in the results. Instead of counting the number of users who placed a label on that street, I think we can just take the number of users who audited that street, but only because we are looking at the set of "good" users (i.e., those with a high labeling frequency). Looking at users who had placed a label on that particular street was mostly meant to find users who acutally audited the street; I think this is covered when we are looking at our set of "good" users.

@manaswis how do you feel about that modification and its justification?

misaugstad commented 6 years ago

Just want to add the rationale behind the user agreement analysis here: high agreement is meant as a proxy for quality of data (since we know that higher agreement means higher precision). We can also use the characteristics of the streets with high/low agreement to characterize the difficulty of certain types of routes.