Closed jonfroehlich closed 5 years ago
I like this idea esp. showing them in between audits.
In issue #1076 @jonfroehlich said:
We have been discussing quality control methods. There are a ton of possibilities here, including:
- Verification interfaces
- Analyzing behavior of worker via interaction logs
- Performing statistical verification of labeling activity. That is, does a labeler's mission have similar label distribution of prior routes in that (i) neighborhood or (ii) neighborhood land use type. This seems relatively easy.
We could perform offline investigations for some of these to investigate further.
For ideas, see Quinn and Bederson's overview of human computation (link). There are lots of research papers on quality control methods for online work that we should examine for ideas as well.
Another verification tool idea that we have all discussed before is a Tinder-style app/page where it shows you one image with a label on it at a time, and you do a quick up/down vote. We have talked about this as a great mini-tool that people can use while sitting in a waiting room, on the elevator, etc.
We are currently trying to get a mobile version of the tool working (see #282 ), but we have also talked about this as just replacing the current auditing tool for mobile users, if the full tool turns out to be too complex for a cell phone form factor.
Another thing that @jonfroehlich mentioned in a meeting we had was an admin tool where we subsample some number of crowd workers and manually review their work by walking through the routes or reviewing a subset of their labels (in particular, by someone who has read through the labeling codebook (see: #961 ).
@jonfroehlich my notes on that meeting are incomplete, and I wasn't quite sure what you meant at the time. If what I just said above reminds you of what you were talking about, could you elaborate a bit? (this meeting was ~2 weeks ago, so I would not expect you to necessarily remember)
Did some preliminary mockups:
I personally like the designs that show fewer pictures at a time. I think they give users more context for what’s going on in the scene, and it might simplify any keyboard controls that we might add later.
I also want to have a feature that allows us to show common examples (picture + short description) of each label type. Maybe the labeling guide is extensive, but I think having a quick reference users can check is important so they can be more sure if they’re classifying labels correctly. We could also link to the Labeling Guide if they want more extensive documentation.
I think it was also be nice to have an ‘unclear’ button. As I was reviewing Obstacle in Path/Surface Problem labels, there were several that I felt unsure about. This might also help us figure out what types of scenarios are borderline and how to deal with them.
Love these mocks. Aileen and I reviewed them today in person and sketched out a rough draft of another design (which is an amalgamation of many of her great ideas). @aileenzeng, perhaps you can post my (awful) sketch when you get a chance.
It's definitely not awful - gets the point across nicely! 😁
Yes yes yes I love all of this! And this most recent mockup really does bring together a lot of my favorite parts of the earlier ones! :smile: I think this really brings together the pieces that I thought were necessary (large SV image, fairly big buttons for yes/no/unsure, documentation on what is/isn't a curb ramp, and a comment field)!
Here's a more detailed version of what the audit page could look like:
I was thinking that the user could scroll up/down documentation (since we can't display that much info at once), but it's a little difficult to tell from this mock. We could maybe add button that allows them to jump down below the audit screen that allows them to see the entire what is a curb ramp/not a curb ramp at once. Alternatively, it could also link to the labeling guide, although I was thinking that the info here would be a little more picture-dense.
Here's what the bottom screen portion might look like:
(for the bottom images, I'd get rid of the curb ramp labels for the real thing)
@aileenzeng and I talked about this in person. Just to quickly summarize:
I'm sure @misaugstad has other amazing advice and insights (as usual). Also, he loves frontend work!!!
Thanks for all the feedback! Here are two more ideas: (the arrows are a little funky on this one)
I really like that 2nd one!!
From @jonfroehlich‘s email:
I like the general direction of the interface so far, but I was thinking when you ask someone to verify a problem, if they say no, we should immediately ask a follow-up question to gather more data. So for example, “ is this a surface problem?” If the user says no, then we immediately ask “is this a sidewalk accessibility problem?”
So the key point here is that we can have multiple question answering stages based on the label type and what the user responds.
Here's my to-do list - I've tried to break up each task into a lot of small tasks. If there aren't a lot of indented checkboxes, that means that I haven't looked at the problem closely yet. I'll update this as I'm working.
Mission Infrastructure-related tasks Backend
Frontend
validation.scala.html
fileMain.js
Label tasks Backend
label_id
as a parameter)user_id
and that sort of info just to load the label onto the panorama]label_id
)validate/:labelId
) that loads the validation interface w/ a given ID. (Unsure if we'll keep this for the long run, but I think this might be helpful for testing).
Frontend
Logging data
validation_task_interaction
tableTracker.js
file that keeps track of all logged events
Form.js
file that can post data
Form.js
works by sending a POST request every time you hit agree/disagree buttonscompileSubmissionData
function that organizes label data / any other data that we're sending backValidationTaskController.scala
so that it can parse data and send it to the appropriate controllers for adding to tables
Panorama tasks I still need to break this down a lot more, but here's the general idea for what I'm hoping to do.
gsv_data
table that marks panoramas as existing/not-existing + timestamp for when we last updated that column
Other
This looks good @aileenzeng, thanks for putting this together. One thing that is missing is a notion of timeline--when do you think you could get some of this stuff done? The $1m question, I know. ;-)
I want to get the mission infrastructure/label/logging tasks done by Nov. 24th (latest), but will make a push to get it done by Nov. 18th. I'll take care of the panorama stuff next, but am still a little hesitant about giving an end date.
Ok, thanks.
On Thu, Nov 8, 2018 at 8:41 PM Aileen Zeng notifications@github.com wrote:
I want to get the mission infrastructure/label/logging tasks done by Nov. 24th (latest), but will make a push to get it done by Nov. 18th. I'll take care of the panorama stuff next, but am still a little hesitant about giving an end date.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/535#issuecomment-437248343, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi-9ewCBjPeHdkfJVsEqczcUq_Ph6HDks5utQeTgaJpZM4MlMlK .
-- Jon Froehlich Associate Professor Paul G. Allen School of Computer Science & Engineering University of Washington http://www.cs.umd.edu/~jonf/ http://makeabilitylab.io @jonfroehlich https://twitter.com/jonfroehlich - Twitter
I've been thinking more about this.
I also think--which we've discussed many times before--that validation is a perfect task for the mobile phone. I'm really hoping that we can utilize the ~20% of traffic that is mobile only and get them to do some fun validation work! :)
I agree with all the points. Some of these points came up in the meeting this week when we were going over Aileen's implementation.
I also think--which we've discussed many times before--that validation is a perfect task for the mobile phone. I'm really hoping that we can utilize the ~20% of traffic that is mobile only and get them to do some fun validation work! :)
I strongly agree! I feel this is a good starting point to venture into mobile interfaces instead of creating a full fledged labeling interface which is a harder task than a validation interface. And there is a lot of scope for gamification. This could easily be a great undergrad project! We should advertise about this project specifically more. It's easy to understand, approachable, and not very intimidating.
I strongly agree! I feel this is a good starting point to venture into mobile interfaces instead of creating a full fledged labeling interface which is a harder task than a validation interface. And there is a lot of scope for gamification. This could easily be a great undergrad project! We should advertise about this project specifically more. It's easy to understand, approachable, and not very intimidating.
Also agree. I was thinking @aileenzeng might take this on after she finishes the web-based validation stuff... :) But we could also discuss recruiting an additional student (I just think they'd have to have a fair bit of dev experience to be able to contribute to this...).
Dumping more thoughts on this...
How do we choose what gets validated and when and by whom? I think this question is really interesting and may involve algorithms from optimization, reputation systems, etc. For example, our system should have ongoing inferences about a worker's quality, which is then strengthened or weakened by validations. I could also imagine using our CV subsystem--which Galen and Esther are currently working on--to help prioritize what gets validated.
First, I suspect (and we could potentially investigate via an online experiment) that it is best to batch validations by label type. In other words, single validation missions are limited to validating just curb ramps or just surface problems. (Other batch strategies are also possible: e.g., batching them in the order a user applied them so there is some spatial context or batching them by neighborhood--but I think batching by label type is best).
Second, in terms of queuing labels for validation (i.e., prioritization). I think our first implementation of this should likely just rely on our CV algorithms confidence output where we, for example, validate ~4 low confidence labels, 3 middle confidence, and 3 high-confidence (I don't want to do all low confidence because I think it feels good to the user to validate some positives and negatives). Obviously, future prioritization algorithms should take into account ongoing inferences about worker quality (i.e., reputation), geographic area (e.g., so we have coverage but also could consider important POIs like hospitals and schools), and perhaps even temporal qualities about a worker (e.g., some early labels and some later labels to study things like learning effects and fatigue). Update (1/16): Actually, I think the initial label queuing algorithm for validation should actually include something about worker reputation--even something simple that dynamically weights labels based on a worker's accuracy (e.g., workers with lower accuracy scores--that is, a higher percentage of 'Disagree' votes for their labels--should get priority but this should be done with some randomness so new workers and workers with good reputations also get validations).
Third, I think we should likely develop more than one validation interface (perhaps the one @aileenzeng is working on plus my original proposal--which we kind of implemented in Tohme, see end of paper). We can than A/B test what works best. Also, reminds me of a Bernstein paper (have to find it) where they had an interface for super rapid validation that they knew had high error but because it was so fast, it didn't matter (i.e., it's simply an optimization problem of speed, error, # of workers, and required accuracy).
Fourth, we have an issue where we are no longer updating our old DC server (legacy code) and there is no way to import the labels into the new PS architecture. Thus, there is no way to validate the labels from our DC deployment (and perhaps there never will be unless we make an additional tool to do so); however, I think that's ok because we can still train on those old labels and then use the trained model to classify new incoming labels on our new deployments.
Hi all,
I think I'm getting very close to finishing implementing functionality for the validation interface (hooray)! The main mid/high-priority items left on my to-dos are:
I was wondering what my next steps should be. Should I start drafting instructions for testing? Or should we worry about any UI polishing? For reference, the validation interface currently looks like this:
(I'm basing the design off this mockup)
I would focus on a full end-to-end MVP rather than polish. We don’t need any UI improvements until we actually start using this thing. So, resolving the three bullet points should take priority.
Also, how are you choosing which labels to get validated? That seems like a crucial step (but certainly one we can iterate on after MVP).
Sent from my iPhone
On Dec 18, 2018, at 12:30 AM, Aileen Zeng notifications@github.com wrote:
Hi all,
I think I'm getting very close to finishing implementing functionality for the validation interface (hooray)! The main mid/high-priority items left on my to-dos are:
Checking if panos exist before sending label ids to the validation interface. Returning a list of labels to be validated rather than just one label at a time. Finishing up logging data (very fast - will be done last) I was wondering what my next steps should be. Should I start drafting instructions for testing? Or should we worry about any UI polishing? For reference, the validation interface currently looks like this:
(I'm basing the design off this mockup)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
It's done!
Also, how are you choosing which labels to get validated? That seems like a crucial step (but certainly one we can iterate on after MVP).
Oops - sorry for not getting back to you on this. Right now, labels are being selected randomly. We run a check on the backend to see if the pano exists or not (if it doesn't, then we select a new random label).
What's done? You have a full end-to-end MVP working? If so, woohoo!
Oops - sorry for not getting back to you on this. Right now, labels are being selected randomly. We run a check on the backend to see if the pano exists or not (if it doesn't, then we select a new random label).
Fine for now but see https://github.com/ProjectSidewalk/SidewalkWebpage/issues/535#issuecomment-447386308
Yep! :) I’m going to work on some testing instructions next, but won’t be at a computer for the next few days.
OK, great. Looking forward to it!
On Wed, Dec 26, 2018 at 9:37 AM Aileen Zeng notifications@github.com wrote:
Yep! :) I’m going to work on some testing instructions next, but won’t be at a computer for the next few days.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/535#issuecomment-449998921, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi-9fCzwEjyeAO1R5f1Y3hUwNpG68fsks5u87PbgaJpZM4MlMlK .
-- Jon Froehlich Associate Professor Paul G. Allen School of Computer Science & Engineering University of Washington http://www.cs.umd.edu/~jonf/ http://makeabilitylab.io @jonfroehlich https://twitter.com/jonfroehlich - Twitter
Here are some testing instructions (sorry for the delay!). I'm putting them in here and not in a PR because I'm not sure if it's ready for a PR yet:
label_validation
: records the validation result for each label and timestamps for when the user started/finished validating a label. Also includes user_id
and mission_id
columns.validation_options
: validation result options (1: agree, 2: disagree, 3: unclear)validation_task_comment
: records feedback that is submitted by clicking the feedback button. This table is the same as audit_task_comment
.validation_task_interaction
: records user interactions on the /validate
endpoint. Similar to the audit_task_interaction
table.535-create-validation-interface
. You might want to use a small dump for testing purposes!label
table.http://0.0.0.0:9000/audit
and place one label onto the panorama.http://0.0.0.0:9000/validate
. Click any of the agree/disagree/unclear buttons a few times and make sure that nothing looks weird (blank GSV Panorama, label not loading, etc...)http://0.0.0.0:9000/audit
and place one label onto the panorama in a different spot than the label in step 3. Then, delete the label. Check the label
table to confirm that this label exists and has the delete
column marked.http://0.0.0.0:9000/validate
. Hit the agree/disagree/unclear buttons a few times. The only label that you should see on the screen is the label from step 3 (the label from step 5 should never appear on the screen).http://0.0.0.0:9000/audit
and audit a few streets / complete a mission or two. Place labels at a variety of zoom levels as well as some incorrect ones if you feel like it :)http://0.0.0.0:9000/validate
. When you refresh, you should have a larger variety of labels remaining. (I left in the console.log
statements that shows the different labelId
s of the labels loaded onto the screen).label_validation
table to see that these labels were validated correctly.
validation_result
column, there should be 1 for agree, 2 for diesagree, and 3 for unclear.validation_task_interaction
to check that these interactions were recorded correctly.
ValidationButtonClick_
event if you clicked a button or a ValidationKeyboardShortcut_
event if you used keyboard shortcuts validation_task_interaction
table for to make sure the ModalSkip_Click
action is logged. validation_task_comment
table in the comment
column.validation_task_interaction
table. It should have ModalComment_ClickOK
and ModalComment_ClickFeedback
.svv.panorama.getProperty("panoId")
into the Chrome console to get the current panorama id. Copy the panorama into the following query:
SELECT * FROM gsv_data WHERE gsv_panorama_id LIKE '<panoId>';
last_viewed
column that shows the last time this panorama was viewed and the expired
column should be marked as false.Here is a query for checking the validation_task_interaction
table that filters out LowLevelEvent_
and POV_Changed
interactions:
SELECT *
FROM validation_task_interaction
WHERE validation_task_interaction.action
NOT IN ('LowLevelEvent_mousemove',
'LowLevelEvent_mouseover',
'LowLevelEvent_mouseout',
'LowLevelEvent_mouseup',
'LowLevelEvent_mousedown',
'LowLevelEvent_keydown',
'LowLevelEvent_keyup',
'LowLevelEvent_click',
'POV_Changed')
ORDER BY TIMESTAMP DESC
@aileenzeng can you just add screenshots of what it looks like right now?
Yep! Updated the comment.
This looks really great @aileenzeng !!! It works really well, it looks like you put a lot of effort into it!!
This looks really great @aileenzeng !!! It works really well, it looks like you put a lot of effort into it!!
Wo0t! Go @aileenzeng, go!
Hooray! Thanks for the support everyone.
@misaugstad Thanks for bringing up those two points - I'll start addressing those!
I also noticed that we're not directly logging information about what the user's screen looks like when they've validated a label in the label_validation
table. Do we want to record information like heading, pitch, zoom there (or do we want a new table that stores that information?)
@aileenzeng hmmm that is tough. I was originally leaning towards keeping it as part of the label_validation
table since that info is included in the label
table.
But now I'm thinking that it isn't as relevant/important in validation. It might also be confused with the actual heading/pitch/zoom from the user who placed the label in the first place...
I'm honestly on the fence. @aileenzeng how about we go with whatever you think would be best (be that the best design, the easiest to implement, the easiest to work with in the future).
I also noticed that we're not directly logging information about what the user's screen looks like when they've validated a label in the label_validation table. Do we want to record information like heading, pitch, zoom there (or do we want a new table that stores that information?)
Yes, we will need to log this and should do so comprehensively (just like we do for the auditing interface)
We probably want to add a delay so that users can't rapid-fire hit a button without actually looking at the pano. Also so that one cannot accidentally hit the button twice and "validate" the second label without meaning to.
I've added a delay that makes the user wait 850ms between validations. It doesn't look like Google StreetView has any listeners that will let us know when the panorama has loaded yet (which would be the more ideal solution to this problem).
You can do some weird stuff with the keyboard shortcuts if you really try to break it. Like if I hold A, tap D, then hold D, then tap N, then hold N, I have now validated 3 labels and all 3 buttons look like they have been pressed 😆 I don't think this is super important though, no one should be doing that...
This should be fixed now!
I also noticed that we're not directly logging information about what the user's screen looks like when they've validated a label in the label_validation table. Do we want to record information like heading, pitch, zoom there (or do we want a new table that stores that information?)
Yes, we will need to log this and should do so comprehensively (just like we do for the auditing interface)
Now the label_validation
table also has the following columns:
canvas_x
: the x-coordinate for the upper left corner of the bounding box for the label
canvas_y
: the y-coordinate for the upper left corner of the bounding box for the label
heading
: user heading
pitch
: user pitch
zoom
: user zoom
canvas_height
: height of the GSV Panorama (always 410px)
canvas_width
: width of the GSV Panorama (always 720px)
http://0.0.0.0:9000/validate
. Validate some labels. You shouldn't see the tutorial labels ever! (You should only get that single label that you had placed on the screen).console.log
statements, you should see something like TOP: ____ LEFT: ____
. Use the element selector in Chrome to select over the label. Check that the values by TOP
and LEFT
are +/- 0.5 same as the top
and left
CSS attributes for the label.label_validation
table.zoom
columns in the label_validation
or validation_task_interaction
columns, they should always be 1.1, 2.1 or 3.1.@aileenzeng this all seems to be working as you describe!
@aileenzeng It looks like this line in your evolution file is missing a semicolon!
DROP TABLE validation_task_comment
Oops - thank you! Good catch!
We talked about this in our UIST'14 paper but I wonder if it would be worth exploring again.
Imagine having a review data tab on Project Sidewalk that shows a grid of cropped images with their labels. The interface would allow the user to select between showing different label types. You could then quickly verify or correct mislabeled items (maybe vote up or down on whether you agree)--kind of like this Picasa interface:
We could also show a subset of this interface in between labeling tasks to break up redundancy of auditing.
A few benefits: