ProjectSidewalk / SidewalkWebpage

Project Sidewalk web page
http://projectsidewalk.org
MIT License
84 stars 24 forks source link

Verification Strategies -- Thoughts and Planning for the Future. #1199

Open jonfroehlich opened 6 years ago

jonfroehlich commented 6 years ago

Verification is crucial to crowdsourcing. Our current PS workflow doesn't even support the most minimal verification (beyond, say, techniques related to majority vote). Ideally, we would have a more sophisticated architecture where verifications are baked-in--for example, after completing some labeling missions, users are presented with a mini-game for rapid verification or if a user visits on a touchscreen, they immediately go into a verification workflow.

Below, I want to highlight some key issues related to verification--which is relevant to the entire PS team but particularly whoever takes verification as their key area next. I think this is a really rich and interesting problem space and I'm guessing there's a publication in here somewhere.

We need to become more abreast of the research literature in this area--particularly verification work published at CSCW, HCOMP, and in Citizen Science venues. We should attempt to adopt state-of-the-art methods and remix them for our context.

One approach will likely include manually verifying labels (in one capacity or another) or, in some cases, perhaps clusters (though clusters have the problem of not mapping easily back to a GSV pano). Verification is a multi-faceted problem, including: (1) how do you choose and prioritize which labels to verify? (2) what sort of interface do you use for verification? (3) can we use machine learning/computer vision to aid this process (e.g., by providing some confidence measure that a label is correct or incorrect)? (4) how do you use the verification data to improve data quality overall. Whatever approach we build should likely be adaptive--as new verifications occur, this should, in real-time, impact our priorization algorithms.

For (1), I could imagine multiple strategies but the overall goal should likely be selecting labels where the information gain from verification is highest (though we have to define what this means). For example, it could mean that we want enough verifications per user such that we can accurately predict that user's labeling quality--this is a worker-centric view. Another example might be that we have some model of how labels should look on a street--and we prioritize labels that do not fit this model (this is just a few ideas--many other verification workflows should be brainstormed and discussed).

For (2), I've asked @adash12 to work on fast-verification interfaces on touchscreens--this is because I want to capture the work of the ~20% of users that visit via touchscreens but also because I think touchscreens might actually be a nice, fast way of doing rapid verifications. One simple interface would be to show a cropped area where a problem has been labeled and say "Is this a Curb Ramp" with a big Yes | No at the bottom of the interface. (Could also try to have Yes | No | Don't Know or even capture confidence in the judgment). What other verification UIs should we try?

For (3), lots of ways to do this--(one is mentioned in the For (1) paragraph above)--but another is to try and train a CNN (or some other ML model) using the ground truth data that we have (+ probably researcher data so that we have more GT samples) that then attempts to evaluate each label for accuracy. We would then try to verify those labels that our computer vision model thinks is wrong or has low confidence.

misaugstad commented 6 years ago

@jonfroehlich We have another Github issue as a discussion thread for verification strategies ( #535 ), do you mind if I c/p your text over there?

jonfroehlich commented 6 years ago

Hold off for now

Sent from my iPhone

On Dec 12, 2017, at 8:26 AM, Mikey Saugstad notifications@github.com wrote:

@jonfroehlich We have another Github issue as a discussion thread for verification strategies ( #535 ), do you mind if I c/p your text over there?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jonfroehlich commented 5 years ago

In a now closed Issue, I discussed this more, including:

How do we choose what gets validated and when and by whom? I think this question is really interesting and may involve algorithms from optimization, reputation systems, etc. For example, our system should have ongoing inferences about a worker's quality, which is then strengthened or weakened by validations. I could also imagine using our CV subsystem--which Galen and Esther are currently working on--to help prioritize what gets validated.

  • How much validation do we need per label?
  • How do we use the validated data in routing other users and in scoring worker quality?
  • How can we use CV in the validation process? Could CV help us auto-validate some labels? Help prioritize what gets manually validated? ...

And also (link):

Dumping more thoughts on this...

How do we choose what gets validated and when and by whom? I think this question is really interesting and may involve algorithms from optimization, reputation systems, etc. For example, our system should have ongoing inferences about a worker's quality, which is then strengthened or weakened by validations. I could also imagine using our CV subsystem--which Galen and Esther are currently working on--to help prioritize what gets validated.

First, I suspect (and we could potentially investigate via an online experiment) that it is best to batch validations by label type. In other words, single validation missions are limited to validating just curb ramps or just surface problems. (Other batch strategies are also possible: e.g., batching them in the order a user applied them so there is some spatial context or batching them by neighborhood--but I think batching by label type is best).

Second, in terms of queuing labels for validation (i.e., prioritization). I think our first implementation of this should likely just rely on our CV algorithms confidence output where we, for example, validate ~4 low confidence labels, 3 middle confidence, and 3 high-confidence (I don't want to do all low confidence because I think it feels good to the user to validate some positives and negatives). Obviously, future prioritization algorithms should take into account ongoing inferences about worker quality (i.e., reputation), geographic area (e.g., so we have coverage but also could consider important POIs like hospitals and schools), and perhaps even temporal qualities about a worker (e.g., some early labels and some later labels to study things like learning effects and fatigue). Update (1/16): Actually, I think the initial label queuing algorithm for validation should actually include something about worker reputation--even something simple that dynamically weights labels based on a worker's accuracy (e.g., workers with lower accuracy scores--that is, a higher percentage of 'Disagree' votes for their labels--should get priority but this should be done with some randomness so new workers and workers with good reputations also get validations).

Third, I think we should likely develop more than one validation interface (perhaps the one @aileenzeng is working on plus my original proposal--which we kind of implemented in Tohme, see end of paper). We can than A/B test what works best. Also, reminds me of a Bernstein paper (have to find it) where they had an interface for super rapid validation that they knew had high error but because it was so fast, it didn't matter (i.e., it's simply an optimization problem of speed, error, # of workers, and required accuracy).

Fourth, we have an issue where we are no longer updating our old DC server (legacy code) and there is no way to import the labels into the new PS architecture. Thus, there is no way to validate the labels from our DC deployment (and perhaps there never will be unless we make an additional tool to do so); however, I think that's ok because we can still train on those old labels and then use the trained model to classify new incoming labels on our new deployments.