Qualitative assessment of model performance for paper

jonfroehlich commented 5 years ago

In email, I wrote:

In the paper, for all label types, we will need some visual examples of where our model does super well and where we fail. I can add this as a github request. But there are really two points here: (1) we need to build intimacy with our results and to try to understand our model's successes and failures; (2) our readers will be interested in this qualitative analysis too; (3) and I guess the third point is whether you want to hack up a quick review tool that enables us to rapidly assess model performance (simply reviewing things in Google Drive is fine too!).

This will be similar to something like we did in the Project Sidewalk paper, however, here we were assessing human labelers rather than automated labelers :) . From that paper:

To better understand labeling errors and to contextualize our quantitative findings, we conducted a qualitative analysis of labeling errors. We randomly selected 54 false positives and 54 false negatives for each label type, which resulted in 432 total error samples from 16 anonymous, 43 registered, and 80 paid workers. A single researcher inductively analyzed the data with an iteratively created codebook. We show the top three errors with examples in Figure 6.

We might, say, review the top ~50 and bottom ~50 highest confidence scores for each label type and assess the issue. (Other qualitative review procedures are also possible).

galenweld commented 5 years ago

Sure thing. I assume you're talking about pre-crop performance, here?

On Mon, Mar 11, 2019 at 6:08 AM Jon Froehlich notifications@github.com wrote:

In email, I wrote:

In the paper, for all label types, we will need some visual examples of where our model does super well and where we fail. I can add this as a github request. But there are really two points here: (1) we need to build intimacy with our results and to try to understand our model's successes and failures; (2) our readers will be interested in this qualitative analysis too; (3) and I guess the third point is whether you want to hack up a quick review tool that enables us to rapidly assess model performance (simply reviewing things in Google Drive is fine too!).

This will be similar to something like we did in the Project Sidewalk paper: [image: image] https://user-images.githubusercontent.com/1621749/54126057-dd2e6600-43c3-11e9-9e37-477156e1d9ed.png

We might, say, review the top ~50 and bottom ~50 highest confidence scores for each label type and assess the issue. (Other qualitative review procedures are also possible).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/14, or mute the thread https://github.com/notifications/unsubscribe-auth/AAV5yvAz0KFi0f3hQBEhPsfWMLMmbQG4ks5vVlU3gaJpZM4bojnk .

jonfroehlich commented 5 years ago

Good question. I think both pre-crop and sliding window would be interesting to analyze. Not sure which would be more fruitful...

Sent from my iPhone

On Mar 11, 2019, at 11:41 PM, Galen Weld notifications@github.com wrote:

Sure thing. I assume you're talking about pre-crop performance, here?

On Mon, Mar 11, 2019 at 6:08 AM Jon Froehlich notifications@github.com wrote:

In email, I wrote:

In the paper, for all label types, we will need some visual examples of where our model does super well and where we fail. I can add this as a github request. But there are really two points here: (1) we need to build intimacy with our results and to try to understand our model's successes and failures; (2) our readers will be interested in this qualitative analysis too; (3) and I guess the third point is whether you want to hack up a quick review tool that enables us to rapidly assess model performance (simply reviewing things in Google Drive is fine too!).

This will be similar to something like we did in the Project Sidewalk paper: [image: image] https://user-images.githubusercontent.com/1621749/54126057-dd2e6600-43c3-11e9-9e37-477156e1d9ed.png

We might, say, review the top ~50 and bottom ~50 highest confidence scores for each label type and assess the issue. (Other qualitative review procedures are also possible).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/14, or mute the thread https://github.com/notifications/unsubscribe-auth/AAV5yvAz0KFi0f3hQBEhPsfWMLMmbQG4ks5vVlU3gaJpZM4bojnk .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jonfroehlich commented 5 years ago

This could be a nice task for @aileenzeng as she did this for our CHI'19 paper (and did so in rigorous fashion). I think she could (and should) use a similar analysis method.

But first, we need to determine:

How many of each label type to review
How many false positives and false negatives
Do we also want to review true positives?
Should we do this for pre-crop and sliding window (probably, right?)

galenweld commented 5 years ago

I'll let @aileenzeng comment on how many to review, because she seems to be the expert.

It doesn't seem like true positives are as useful, so we can exclude them if we're tight on space, but it's easy to get them anyhow.

I certainly think we should do this for pre-crop as well as fore the full-scene labeling, although it occurs to me that for the full-scene labeling analysis we may want to do a less structured/more qualitative overall analysis of some labeled panos in their entirety, find some interesting repeated themes, and include examples with comments.

Either way, I'll tweak the model code right now so that it's easy to get as many examples of each type as we want.

jonfroehlich commented 5 years ago

Agree with everything you said Galen. I think the structured analysis (similar to CHI’19) is most relevant to pre-crops. The sliding window can be done more holistically (but with no less rigor).

Sent from my iPhone

On Apr 22, 2019, at 11:45 AM, Galen Weld notifications@github.com wrote:

I'll let @aileenzeng comment on how many to review, because she seems to be the expert.

It doesn't seem like true positives are as useful, so we can exclude them if we're tight on space, but it's easy to get them anyhow.

I certainly think we should do this for pre-crop as well as fore the full-scene labeling, although it occurs to me that for the full-scene labeling analysis we may want to do a less structured/more qualitative overall analysis of some labeled panos in their entirety, find some interesting repeated themes, and include examples with comments.

Either way, I'll tweak the model code right now so that it's easy to get as many examples of each type as we want.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

aileenzeng commented 5 years ago

I think doing ~50/label type would be doable. Would I be receiving a big folder with all the images and all I would have to do is review them? (If this is the case, I could probably be able to do more, because I remember one of the more time consuming parts for me was individually loading all the label Ids).

I think I could dedicate ~1.5-2 hrs per day through Thursday, and then maybe 4 hours on Friday-Sunday. I'm not super sure how next week looks, quite yet.

galenweld commented 5 years ago

Awesome, Aileen, thanks so much!

Yes, I would just get you a bunch of folders of individual images to review. No unpleasant loading needed.

jonfroehlich commented 5 years ago

So, @galenweld and I spoke about this in person. If possible, could we do the following (for pre-crops only): randomly select 50 false positives and 50 false negatives for each label type (that will be 4 x 100 = 400 reviews).

@aileenzeng, how did you track this for CHI'19? Using a spreadsheet?

aileenzeng commented 5 years ago

That sounds good! I can write up a quick description of how I plan to label things later tonight.

Yep! Here's the link to it: https://docs.google.com/spreadsheets/d/1_WK8Uof-Pf8ofY0jL8g79qQgBktZfpwhzBM_VMpfcrU/edit?usp=sharing

After I finished labeling all the images, I used a separate spreadsheet to organize all the numbers: https://docs.google.com/spreadsheets/d/1hn00ntV7qQa16e51ETm-PAMrFb5p5Q3zp7_Fs62YmsI/edit?usp=sharing

galenweld commented 5 years ago

Alrighty gang, I wrote code for model training that saves the actual and predicted label for each image in the test set alongside the path to that image.

This makes it easy for us to compute anything we want in terms of error analysis, but for now I wrote code to compute the false positives and false negatives for each label type, and randomly sample as many of those as we want, then copy them to a folder for easy sharing with Google Drive, etc. All this is in commit 328dee3.

Now that this is in place, all that's needed before @aileenzeng (or anyone else) can start looking over the images is for us to decide which model's results we want to analyze. @jonfroehlich maybe we can discuss this at our meeting tomorrow/this afternoon (Tuesday)?

jonfroehlich commented 5 years ago

Nice job @galenweld!

I think we will want to select the "best performing" model for pre-crops to analyze. But perhaps this is not as straightforward as it sounds?

galenweld commented 5 years ago

I agree - just a question of the tradeoffs between improved performance on some classes with one model, vs improved performance on other classes with another model.

On Tue, Apr 23, 2019 at 5:50 AM Jon Froehlich notifications@github.com wrote:

Nice job @galenweld https://github.com/galenweld!

I think we will want to select the "best performing" model for pre-crops to analyze. But perhaps this is not as straightforward as it sounds?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/14#issuecomment-485789013, or mute the thread https://github.com/notifications/unsubscribe-auth/AACXTSTPCFBQ54N62A522TDPR4AZNANCNFSM4G5CHHSA .

jonfroehlich commented 5 years ago

Cool. Perhaps you can look over our results and come up with some recommendations about what to do before we meet.

aileenzeng commented 5 years ago

@galenweld @jonfroehlich are these images ready for analysis yet? Or should I hold off for now?

galenweld commented 5 years ago

I'll have them for you by the end of the day.

On Wed, Apr 24, 2019 at 2:24 PM Aileen Zeng notifications@github.com wrote:

@galenweld https://github.com/galenweld @jonfroehlich https://github.com/jonfroehlich are these images ready for analysis yet? Or should I hold off for now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/14#issuecomment-486432682, or mute the thread https://github.com/notifications/unsubscribe-auth/AACXTSTWNGOZXO3B73N2STDPSDF2FANCNFSM4G5CHHSA .

aileenzeng commented 5 years ago

Ok, sounds good! :)

galenweld commented 5 years ago

Hi @aileenzeng, sorry to be slow on this. I've got your images here! They were slow to upload as separate files, so I zipped them together.

Inside, you should find two directories: false_pos and false_neg. Inside each of those directories are 5 directories: curb ramp, missing ramp, obstruction, surface problem, and null. Inside each of those are 100 randomly sampled images.

Assuming I didn't get things mixed up (which is entirely possible, it's late at night, so @jonfroehlich you should double check me here) the false positives contains images that were erroneously identified as feature x, and the false negatives contains images of feature x that were erroneously identified as something else. So, for example:

false_pos/curb_ramp contains images of things that were mistakenly identified as curb ramps

false_neg/missing_ramp contains images of missing curb ramps that were misidentified as something else.

I know you already know this, I'm just writing it out to double-check my own understanding.

galenweld commented 5 years ago

I also haven't had time to pretty-ify it, but I also wrote code to produce the confusion matrix. Tomorrow morning will go in and color the cells proportionally to the normalized values for each row, and add it to the paper.

aileenzeng commented 5 years ago

Great - thanks! I'll try to do a quick review by tomorrow afternoon-ish (maybe 5-10 per label type/category). If you have a time to look over my initial review just to make sure that things are going smoothly/to check that I'm getting the important stuff, that would be nice!

Here's the link to the Google Drive folder that I'll be using that has everything! https://drive.google.com/drive/folders/1ZgRTF28Ue7eQeXvKW6D7zzC_qVDaNJ8E?usp=sharing.

Here's the main spreadsheet (inside the folder): https://docs.google.com/spreadsheets/d/1fN4sfV-0yJDJDUxJMjAA9PP6DAv1h-T8kWYoKu9tQO4/edit?usp=sharing

jonfroehlich commented 5 years ago

I am mobile right now but I just wanted to double check that it’s 400 images total? 100 per label tape with 50 false positives and 50 false negatives?

Sent from my iPhone

On Apr 25, 2019, at 12:00 PM, Aileen Zeng notifications@github.com wrote:

Great - thanks! I'll try to do a quick review by tomorrow afternoon-ish (maybe 5-10 per label type/category). If you have a time to look over my initial review just to make sure that things are going smoothly/to check that I'm getting the important stuff, that would be nice!

Here's the link to the Google Drive folder that I'll be using that has everything! https://drive.google.com/drive/folders/1ZgRTF28Ue7eQeXvKW6D7zzC_qVDaNJ8E?usp=sharing.

Here's the main spreadsheet (inside the folder): https://docs.google.com/spreadsheets/d/1fN4sfV-0yJDJDUxJMjAA9PP6DAv1h-T8kWYoKu9tQO4/edit?usp=sharing

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

galenweld commented 5 years ago

I randomly sampled 100 of each type for a total of 800, but happy to change that to whatever we like.

jonfroehlich commented 5 years ago

I don’t think there’s anyway she can qualitatively analyze 800 images. I also don’t think she needs to. I’m mobile right now so I can’t at this thread on github but I believe I outlined an approach previously to follow.

Sent from my iPhone

On Apr 25, 2019, at 12:15 PM, Galen Weld notifications@github.com wrote:

I randomly sampled 100 of each type for a total of 800, but happy to change that to whatever we like.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

galenweld commented 5 years ago

My apologies. Earlier, @aileenzeng had proposed 50 images for each of false positives and false negatives, but had said she could do more if she just needed to look at them, which is why I had exported more:

I think doing ~50/label type would be doable. Would I be receiving a big folder with all the images and all I would have to do is review them? (If this is the case, I could probably be able to do more, because I remember one of the more time consuming parts for me was individually loading all the label Ids).

I'll downsample this to 50 per label type, as you requested earlier, and re-upload. My apologies.

jonfroehlich commented 5 years ago

This is what I think we should do: https://github.com/galenweld/project_sidewalk_ml/issues/14#issuecomment-485576408. Copying here again for clarity and convenience. Happy to discuss more but I want to understand why we would be making different choices than the study method I proposed:

If possible, could we do the following (for pre-crops only): randomly select 50 false positives and 50 false negatives for each label type (that will be 4 x 100 = 400 reviews).

galenweld commented 5 years ago

Certainly – and again, I'm sorry if I was unclear. I'm by no means proposing any deviation from the method you proposed, I simply uploaded more data than was necessary because I figured it was easier to have more than we need, then to go back and get more.

jonfroehlich commented 5 years ago

Got it.

I didn't understand why a deviation had occurred (and without explanation)
It's not clear to me that we have to analyze 800 images to get at the same findings as analyzing 400 images and I very much value @aileenzeng's time (also, 800 images is not what she signed up for). So, I am not against making deviations to my original proposal (and no apology necessary) but if we do deviate, I want strong rationale for why

galenweld commented 5 years ago

@aileenzeng, here's a link to a smaller set of images with 50 per false_pos, false_neg per label type.

aileenzeng commented 5 years ago

Sounds good! Thanks :)

jonfroehlich commented 5 years ago

I just started looking through some of these, and I'm a bit confused. For example, here are the false negative "curb ramp" examples--which, based on the definition of what "false negative curb ramp"means, should contain actual curb ramps but the CV model did not recognize them.

While I haven't looked through the entire dataset of 50, it seems like many (most?) of these crops do not actually contain curb ramps so why are they marked as false negatives?

And then for false positive examples of curb ramps (which should be crops that the algorithm thinks have curb ramps but do not), I am seeing lots of actual curb ramps images:

I'm also seeing crops that are completely black, which likely means that the pano is all black. We should be filtering out all black panos from training and testing and reporting on how many panos were filtered by this. We should also report on how many panos we couldn't actually get access to because they were never scraped, etc.

Interestingly, I'm also seeing examples that have 25-50% of the image filled with a curb ramp and are still being marked as a false positive.

So, some possibilities:

there is an error in @galenweld's code that categorizes things as false positives or false negatives and produces this datasets
there is something weird going on with our classification algorithms and/or the underlying dataset (e.g., due to noise, due to how the panos are cropped with @tongning's code)

galenweld commented 5 years ago

Did I just mix up my false positives and false negatives?

On Thu, Apr 25, 2019 at 8:31 PM Jon Froehlich notifications@github.com wrote:

I just started looking through some of these, and I'm a bit confused. For example, here are the false negative "curb ramp" examples--which should contain actual curb ramps but the CV model did not recognize them.

[image: image] https://user-images.githubusercontent.com/1621749/56781371-ca1afe00-6797-11e9-980a-4db0008f4d1b.png

While I haven't looked through the entire dataset of 50, it seems like many (most?) of these crops do not actually contain curb ramps so why are they marked as false negatives?

[image: image] https://user-images.githubusercontent.com/1621749/56781481-30078580-6798-11e9-87fb-e88310ef0345.png

[image: image] https://user-images.githubusercontent.com/1621749/56781492-35fd6680-6798-11e9-9bd1-4c75920dc2ae.png

[image: image] https://user-images.githubusercontent.com/1621749/56781502-3c8bde00-6798-11e9-8d9c-cd3514b08b78.png

[image: image] https://user-images.githubusercontent.com/1621749/56781511-431a5580-6798-11e9-9290-e56d4c46694b.png

[image: image] https://user-images.githubusercontent.com/1621749/56781535-5f1df700-6798-11e9-976c-d6a2ae91f954.png

And then for false positive examples of curb ramps (which should be crops that the algorithm thinks have curb ramps but do not), I am seeing lots of actual curb ramps images:

[image: image] https://user-images.githubusercontent.com/1621749/56781725-fb47fe00-6798-11e9-8421-28be617f4035.png

I'm also seeing crops that are completely black, which likely means that the pano is all black. We should be filtering out all black panos from training and testing and reporting on how many panos were filtered by this. We should also report on how many panos we couldn't actually get access to because they were never scraped, etc.

So, some possibilities:

there is an error in @galenweld https://github.com/galenweld's code that categorizes things as false positives or false negatives and produces this datasets

there is something weird going on with our classification algorithms and/or the underlying dataset (e.g., due to noise, due to how the panos are cropped with @tongning https://github.com/tongning's code)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/14#issuecomment-486913235, or mute the thread https://github.com/notifications/unsubscribe-auth/AACXTSUJTKIW7BUTQHBH2F3PSJZSPANCNFSM4G5CHHSA .

jonfroehlich commented 5 years ago

I don’t think so... also, what about black images and partial corrects

Sent from my iPhone

On Apr 25, 2019, at 10:56 PM, Galen Weld notifications@github.com wrote:

Did I just mix up my false positives and false negatives?

On Thu, Apr 25, 2019 at 8:31 PM Jon Froehlich notifications@github.com wrote:

I just started looking through some of these, and I'm a bit confused. For example, here are the false negative "curb ramp" examples--which should contain actual curb ramps but the CV model did not recognize them.

[image: image] https://user-images.githubusercontent.com/1621749/56781371-ca1afe00-6797-11e9-980a-4db0008f4d1b.png

While I haven't looked through the entire dataset of 50, it seems like many (most?) of these crops do not actually contain curb ramps so why are they marked as false negatives?

[image: image] https://user-images.githubusercontent.com/1621749/56781481-30078580-6798-11e9-87fb-e88310ef0345.png

[image: image] https://user-images.githubusercontent.com/1621749/56781492-35fd6680-6798-11e9-9bd1-4c75920dc2ae.png

[image: image] https://user-images.githubusercontent.com/1621749/56781502-3c8bde00-6798-11e9-8d9c-cd3514b08b78.png

[image: image] https://user-images.githubusercontent.com/1621749/56781511-431a5580-6798-11e9-9290-e56d4c46694b.png

[image: image] https://user-images.githubusercontent.com/1621749/56781535-5f1df700-6798-11e9-976c-d6a2ae91f954.png

And then for false positive examples of curb ramps (which should be crops that the algorithm thinks have curb ramps but do not), I am seeing lots of actual curb ramps images:

[image: image] https://user-images.githubusercontent.com/1621749/56781725-fb47fe00-6798-11e9-8421-28be617f4035.png

I'm also seeing crops that are completely black, which likely means that the pano is all black. We should be filtering out all black panos from training and testing and reporting on how many panos were filtered by this. We should also report on how many panos we couldn't actually get access to because they were never scraped, etc.

So, some possibilities:

there is an error in @galenweld https://github.com/galenweld's code that categorizes things as false positives or false negatives and produces this datasets

there is something weird going on with our classification algorithms and/or the underlying dataset (e.g., due to noise, due to how the panos are cropped with @tongning https://github.com/tongning's code)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/14#issuecomment-486913235, or mute the thread https://github.com/notifications/unsubscribe-auth/AACXTSUJTKIW7BUTQHBH2F3PSJZSPANCNFSM4G5CHHSA .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

galenweld commented 5 years ago

Sorry for the brief response earlier:

For false positives, this I suspect is primarily because there are lots of actual curb ramps in the panos that are not labeled as such by human labelers (this is an issue that we've been trying to mitigate for a long time). As such, the algorithm recognizes them, but gets "told" that it's wrong (even though it's really right) since there's no human-placed label for a, say, curb ramp at that location. Fundamentally, this is attributable to noise in the dataset.

As for the false negatives, I haven't dug into it in great detail but I suspect it's also due to errors in the dataset - the dataset is really noisy!

As for black crops - yes, there are some in there, unfortunately I have not filtered those out.

jonfroehlich commented 5 years ago

Thanks. You should filter black images out. Should be easy—black image panoramic file sizes should comparatively small to regular panos (similarly, the crops will be small too). Another way to detect and filter these out is to look at the avg pixel brightness in image and filter out super dark images.

WRT noise. Could we do this qualitative analysis task on the CV ground truth dataset?

I’m also alarmed with how noisy the pre-crop dataset is. It’s far noisier than I would have thought, which indicates to me that we should have spent more time building a qualitative analysis tool or at least carefully reviewing a subset of classified panos when running initial experiments. Given how bad these false positive and false negative examples are, I have a hard time understanding why our performance numbers are so good.

Given this and the needs of our research to better understand performance, I think Aileen will need to analyze 50 true positives of each label type as well. I’m very curious to see if this is just as bad as the false positive and false negative examples.

Sent from my iPhone

On Apr 26, 2019, at 12:31 AM, Galen Weld notifications@github.com wrote:

Sorry for the brief response earlier:

For false positives, this I suspect is primarily because there are lots of actual curb ramps in the panos that are not labeled as such by human labelers (this is an issue that we've been trying to mitigate for a long time). As such, the algorithm recognizes them, but gets "told" that it's wrong (even though it's really right) since there's no human-placed label for a, say, curb ramp at that location. Fundamentally, this is attributable to noise in the dataset.

As for the false negatives, I haven't dug into it in great detail but I suspect it's also due to errors in the dataset - the dataset is really noisy!

As for black crops - yes, there are some in there, unfortunately I have not filtered those out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

aileenzeng commented 5 years ago

Sounds good - I could handle doing an additional 50 true positives/label type!

jonfroehlich commented 5 years ago

And to be clear, when I was talking about filtering out black images, I was speaking about completely removing them from our training and test datasets.

Sent from my iPhone

On Apr 26, 2019, at 8:57 AM, Aileen Zeng notifications@github.com wrote:

Sounds good - I could handle doing an additional 50 true positives/label type!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

galenweld commented 5 years ago

Understood. I will export true positives and get them to Aileen asap for analysis.

As for black images, we should have a conversation about how high a priority this is – I don't think it'll take long to filter them out, but if we need to re-run all of our models on the filtered datasets, that's a slower process.

jonfroehlich commented 5 years ago

Thanks re: true positives.

I'm just disappointed (and surprised) that we never filtered them out. They are so clearly erroneous and so clearly not deserving of being in our dataset and so easy to filter out. :( While I imagine it's only a small fraction of our dataset, not filtering them just seems weird and doesn't meet my quality standards. But perhaps there is nothing we can do with less than a week until the deadline. At the very least, you should determine had many panos in training and test are black and how many crops are as well.

galenweld commented 5 years ago

You're 100% right that it's not up to your quality standards – I sincerely apologize. Not trying make excused for myself here, but the reason I hadn't removed them beforehand is because I hadn't noticed any black panos or crops when looking over the datasets until now - so it hadn't even occurred to me that this was a concern.

That being said, you're right. I'm very sorry to disappoint you.

galenweld commented 5 years ago

I also wanted to mention (and you noted this above) the idea that had occurred to me last night, about doing this analysis using crops from Esther and I's ground-truth labels instead of the crowdsourced labels. I suspect that's a better proposition all around, and at the end of the day, I think that not only is less noisy but more informative.

galenweld commented 5 years ago

Hey gang, sorry to be slow on this. I have gone back and re-run our model on the centered crops I ran yesterday using the ground truth Esther and I labeled, and sampled true positives, false positives, and false negatives from these results. I've gone through them and they look much, much better. Lots and lots of interesting things to find here - for example, right off the bat I noticed (unsurprisingly) that in our false-negatives for curb ramps, the ones we miss far more frequently are the ones that are facing away from the camera - harder to detect! These will all be good things to write about in the discussion. @aileenzeng I think it's safe to say you can start looking through these now... here's the link!

Once you've had a chance to review, @aileenzeng let's sync up and work together on making figures, adding to the paper, and writing about it, etc. Let me know if you have any questions!

aileenzeng commented 5 years ago

Great - thanks! I just finished sticking all the data in the spreadsheet and noticed that we aren't quite at 50 images per category. I was just curious if this is okay or if it means that we'll have have to cut down on the number of images I should analyze?

Curb Ramp

Category	Progress	Total Images
True Positive		50
False Positive		50
False Negative		34

Missing Curb Ramp

Category	Progress	Total Images
True Positive		50
False Positive		50
False Negative		50

Obstacle in Path

Category	Progress	Total Images
True Positive		50
False Positive		50
False Negative		44

Surface Problem

Category	Progress	Total Images
True Positive		50
False Positive		17
False Negative		50

Null

Category	Progress	Total Images
True Positive		50
False Positive		19
False Negative		50

galenweld commented 5 years ago

Whoops, I'm so sorry! I had meant to mention this in the previous comment but I forgot to. I think this is fine -the reason we have less than fifty in some cases is because between the good model performance and the smaller number of examples in the ground truth labels, we don't have fifty incorrects for every label type. I defer to @jonfroehlich for the final say here, but as long as we note it in the paper I think this analysis is still very valuable.

jonfroehlich commented 5 years ago

@aileenzeng, thanks for checking in about this. I agree with @galenweld. I think it's fine to move forward with this unbalanced dataset for analysis.

aileenzeng commented 5 years ago

Hi all,

I just finished with an initial review (~10 images per category). It'd be really great if either of you have any time to look over my work really quickly to see if there's anything that needs changing, or if there are codes that might be helpful for me to include! Here's the link to the sheet I'm using: https://docs.google.com/spreadsheets/d/1fN4sfV-0yJDJDUxJMjAA9PP6DAv1h-T8kWYoKu9tQO4/edit#gid=1533155948

The main thing that I feel uncertain about are the null labels (for true positive, false positive and false negative), since I'm not sure how to label these. They generally seem to be roads, and I'm not sure what other descriptors would be useful.

There are other cases where I'm not sure if I'm unsure about what codes to add (sometimes due to miscategorization?). As an example, here's an image for missing curb ramp false negatives, where I can't tell that there was a missing curb ramp present. Doesn't happen super often, but I've come across it a few times:

galenweld commented 5 years ago

Hey @aileenzeng I just spent some time looking over your work so far – this looks amazing, and also looks like it was a ton of work – so thank you so much for doing it! I think your approach is spot on, and this analysis is super useful. A couple of miscellaneous notes:

Missing curb ramps are an especially tricky class, as so much of whether or not a curb ramp is missing depends on the context that's often (mostly?) not present within the crop around the tentative feature. For example, if there's the edge of a crosswalk abutting a curb, then the end of the crosswalk will be visible in the crop of the curb, so that lets us tell that there's a missing curb ramp. However, say, if the crosswalk isn't marked with paint in the roadway, then just looking at a crop of the curb isn't really possible to say if it's a missing curb ramp or not – that depends on what's across the street from it, and more, all of which isn't visible just within the frame. This is what makes this problem so challenging for CV systems and we'll definitely be discussing it in the paper – so if you're looking at false negatives and true positives for curb ramps, keep an eye out for any identifying features that may be 'give-aways' that this is indeed a missing curb ramp.

For crops that are low resolution, too tight or too wide etc, not perfectly centered, I wouldn't worry too too much. Finite resolution and imperfect depth data lead to imperfect and limited cropping - this is just another challenge with the CV approach.

For curb ramps in particular, I notice you've got a 'tricky angle' tag – one thing that I think would be useful would be to qualitatively assess which angles seem to be the trickiest. So perhaps that could be something you could keep track of?

Perhaps you and I can find some time in the next day or two to sit down together and work on actually getting this slotted into the paper – that would be great!

jonfroehlich commented 5 years ago

Wow, I agree that this is an amazing amount of work--thanks Aileen. I also agree that this is a fairly complicated analysis--seemingly more so than analyzing human labeler mistakes (would you agree?). The key, to me, just like in the CHI'19 paper is for us to be able to synthesize your qualitative analysis (the per image codes/tags) into higher-level themes (e.g., the top three most common false positive/false negative mistakes for curb ramps, for obstacles, for surface problems, etc.).

If you do find mistakes in the false positive/false negative categorization, please go over these with Galen. It could be that they made a mistake in their ground truth data production or, perhaps, something else is going on.

On Sun, Apr 28, 2019 at 10:28 PM Galen Weld notifications@github.com wrote:

Hey @aileenzeng https://github.com/aileenzeng I just spent some time looking over your work so far – this looks amazing, and also looks like it was a ton of work – so thank you so much for doing it! I think your approach is spot on, and this analysis is super useful. A couple of miscellaneous notes:

Missing curb ramps are an especially tricky class, as so much of whether or not a curb ramp is missing depends on the context that's often (mostly?) not present within the crop around the tentative feature. For example, if there's the edge of a crosswalk abutting a curb, then the end of the crosswalk will be visible in the crop of the curb, so that lets us tell that there's a missing curb ramp. However, say, if the crosswalk isn't marked with paint in the roadway, then just looking at a crop of the curb isn't really possible to say if it's a missing curb ramp or not – that depends on what's across the street from it, and more, all of which isn't visible just within the frame. This is what makes this problem so challenging for CV systems and we'll definitely be discussing it in the paper – so if you're looking at false negatives and true positives for curb ramps, keep an eye out for any identifying features that may be 'give-aways' that this is indeed a missing curb ramp.

For crops that are low resolution, too tight or too wide etc, not perfectly centered, I wouldn't worry too too much. Finite resolution and imperfect depth data lead to imperfect and limited cropping - this is just another challenge with the CV approach.

For curb ramps in particular, I notice you've got a 'tricky angle' tag – one thing that I think would be useful would be to qualitatively assess which angles seem to be the trickiest. So perhaps that could be something you could keep track of?

Perhaps you and I can find some time in the next day or two to sit down together and work on actually getting this slotted into the paper – that would be great!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/14#issuecomment-487455247, or mute the thread https://github.com/notifications/unsubscribe-auth/AAML55NE7FVIQ77B2P7JLJ3PS2BOPANCNFSM4G5CHHSA .

-- Jon Froehlich Associate Professor Paul G. Allen School of Computer Science & Engineering University of Washington http://makeabilitylab.io @jonfroehlich https://twitter.com/jonfroehlich - Twitter Help make sidewalks more accessible: http://projectsidewalk.io

aileenzeng commented 5 years ago

@galenweld Sounds good! I will try to wrap up the analysis today and maybe we can meet up tomorrow?

galenweld commented 5 years ago

Sounds great. Are you free tomorrow at 4:30?

On Mon, Apr 29, 2019 at 12:29 PM Aileen Zeng notifications@github.com wrote:

@galenweld https://github.com/galenweld Sounds good! I will try to wrap up the analysis today and maybe we can meet up tomorrow?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/14#issuecomment-487711961, or mute the thread https://github.com/notifications/unsubscribe-auth/AACXTSR4763ZQS7DKXP3UTTPS5EAXANCNFSM4G5CHHSA .

aileenzeng commented 5 years ago

Hmm, I have something that I need to leave for at 5:15, so I could meet up for ~45 min! (Earlier would probably be better if we need longer)

aileenzeng commented 5 years ago

Hi all,

Quick update: the first part of the analysis is done -- I finally compiled all the codes together and listed their % occurrences, so we have the most frequent codes per label/category (although numbers are still tentative because there are some potential false negative miscategorizations that we still need to talk about). There's a sheet called "analysis" in the google doc that has all the results.

The next thing to do after we review the miscategorizations / finalize the numbers is probably to find good images to go along with each of the codes. I'll probably start another document for that at some point.

I'm also kind of time-crunched by some other schoolwork at the moment, so am still figuring out ways to make things work! Trying to get this all done by tonight, hopefully.

ProjectSidewalk / sidewalk-cv-assets19