Open misaugstad opened 4 years ago
This should probably be generalized a little bit to maybe computing a "labels per meter" stat for each street and weighting streets with fewer labels per meter higher.
Just came here to say that you are essentially replying to yourself nearly one year later! :D
Love these long-era self-communications.
There are a few edge cases to take into account here.
We can probably deal with the first point by just excluding streets that are very short.
We'll have to think about a formula that doesn't cause an issue with point 2. Maybe the whole goal of this issue is to deal with streets that have almost no labels, and so we just increase priority for streets with fewer than 4 (that num is arbitrary) labels, excluding very short streets.
And I think that streets should only have their priority if only one "high quality" user has audited the street. Once a second high quality user has audited it, we don't have to increase priority anymore. OR we can increase priority less drastically. I was imagining that if one high quality user has audited the street but there are very few labels, we set priority to 1, treating it as if it hasn't been audited before. Once a second set of eyes agrees that there shouldn't be labels on the street we don't need to keep showing it to people
One thing we worry about when thinking about the quality of our data is the prevalence of false negatives (the number of missed labels); false positives are not as big of a deal, because human and/or computer validation can rule them out easily enough. As such, I think we should slightly increase the audit priority for streets that have no labels associated with them.
What we do right now is to score users based off of their accuracy (based on validations) and labeling frequency. Then we determine audit priority for streets entirely from how many times they were audited (and whether or not they were audited by users with a high accuracy). I think this is really good and gets us most of the way there. However, high quality auditors can still make mistakes. It's also possible that GSV could route them in a weird way that made it hard for them to label or confused them. Maybe an auditor who isn't actually that good slipped through our system and was marked as "high quality" even though they were skipping some streets. Maybe there is a generally good auditor, but they don't remember to place "No Sidewalk" labels, so there are no labels for some streets.
Anyway, I think there are a lot of reasons for a street to show up with no labels, and false negatives are something we really want to avoid. Which makes me think that we should slightly adjust the algorithm to add some more weight to streets with no labels (assuming this doesn't take TOO much time).