Closed galenweld closed 5 years ago
This is important, I think. We need to get the pano scraper working for Newberg (and Seattle). @tongning, can you help us with this? Afterwhich, we should be able to integrate the Newberg reather seamlessly. At the very least, we could train on the DC dataset and test on the Newberg dataset. We could also try to integrate some portion of the Newberg dataset (~50%) into training as well and then test on the rest.
We now have 6,113 labels and 7,545 validations in Seattle and 4,912 labels and 695 validations in Newberg.
Given that these datasets might be a bit noisy (unclear), we could do one of the following (note: this is ordered by increasing expected quality, which is inversely proportional to data amount):
Here's an idea - based on the conversation @jonfroehlich and I had in person at DUB today....
What if instead of doing two different tests with the newberg and seattle data, one using all labels, and one using just the researcher labels, instead we could use the researcher labels as the "test" set and the non-researcher labels as the "train" set.. This was the test set is (hopefully) higher quality and can be used as a psuedo-ground truth set, whereas the train set of non-researcher labels is noisier, similar to our DC data.
Honestly, I really don't trust the non-researcher labels yet. Need more time to qualitatively explore them. :-/ If our validation interface results are any indicating, I'd say we are looking at around ~70% accuracy in Seattle (not sure about Newberg).
The experiments I would most like to see:
On Newberg and Seattle data, I've finished running crops (using just researcher labels for now), partitioned into an 80/20 test and val set for both cities I uploaded these crops to the VM tonight, and ran them through our model to see how they perform I would like to discuss with you tomorrow our strategies for what experiments we would like to run (in terms of training new models using the data) as since we don't have that much time, I want to make sure we use the time we have as efficiently as possible - I have some ideas on this front as well that are probably easier to discuss in person
The good news is that we do quite well on pre-cropped prediction, where we use our DC model and apply it, unmodified, to the new city's labels. Here's our accuracy for Newberg:
Label p r num
sfc_problem 041.05% 071.51 186
null 098.17% 031.80 2195
ramp 041.13% 095.62 754
obstruction 029.06% 072.66 128
missing_ramp 044.75% 073.06 245
This is really encouraging, exciting, and a good result for the paper! It's interesting to note the our accuracy for the all feature types is not woefully different from our accuracy in DC, except for null_crops. Now, null_crops are not as critical to the precrop/validation task as they are to the full scene labeling, but it does pique my curiosity. On Seattle, we don't do quite as well:
Label Acc
sfc_problem 005.15%
null 024.70%
ramp 034.62%
missing_ramp 005.91%
obstruction 057.47%
overall 057.47%
However, I'm cautious to dismiss the performance of the model here - when I look at the crops from the Seattle labels, they're absolutely atrocious, with lots of garbage labels everywhere. I need to dig further to see if this is an issue with the labels, an issue with my cropping code, or something else, but I'm somewhat perplexed, as the code I used is identical to the code used for the Newberg crops, which worked just fine.
Wow, this is a roller coaster ride--I'm amazed with the Newberg performance. To be clear, of the three experiments I mentioned in https://github.com/galenweld/project_sidewalk_ml/issues/18#issuecomment-486422706, you ran the first one (train on DC, test on Seattle and Newberg), right?
Also, to help me understand these numbers, it's really essential that you add the number of pre-crops for each label type (should be an additional column in your table). I also want to know how many panos we are using for each city and if we are using only researcher labels? (Or what labels we are using).
We need to do more digging into why Seattle is performing so poorly. Is it the quality of data as you suspect? If so, why? Is the cropper broken for Seattle?
Again, this comes down to building better tools to help us qualitatively analyze performance results--something we've been talking about since https://github.com/galenweld/project_sidewalk_ml/issues/6 (and is even mentioned in one of our first github Issues: https://github.com/galenweld/project_sidewalk_ml/issues/2).
Your understanding of the experiment is correct. This is taking a model trained only on DC data and applying it directly to Newberg/Seattle data.
I'll add the number of pre-crops to the above table – sorry for not doing that earlier.
We are using just researcher labels for now. Using just these labels gets us 1334 panos for Newberg, and 457 for Seattle.
I agree about more digging for Seattle. Hard to tell if the cropping tool is broken without having another way of visualizing the crowdsourced labels. @misaugstad or @jonfroehlich , is there someway of viewing the labels that have already been placed from within a pano using the Project Sidewalk interface? How have you been doing this for previous papers?
Hmm, that's not very much Seattle data. How old is your data dump? I think we should have used the most recent dump possible....
To view labels, go to Admin -> Contributions. Look at the 'GSV' column on far right.
Thanks, I'll poke around with this interface.
My Seattle dump was given to me by Mikey the morning of Sunday Apr 21. (iirc the morning he started using Turkers).
On Fri, Apr 26, 2019 at 10:12 AM Jon Froehlich notifications@github.com wrote:
Hmm, that's not very much Seattle data. How old is your data dump? I think we should have used the most recent dump possible....
To view labels, go to Admin -> Contributions. Look at the 'GSV' column on far right.
[image: image] https://user-images.githubusercontent.com/1621749/56824577-b90bd480-680b-11e9-8a99-549e2aaa40ac.png
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/18#issuecomment-487131848, or mute the thread https://github.com/notifications/unsubscribe-auth/AACXTSS7KYIBMFXV2BY5DSDPSMZWBANCNFSM4HDQHJUQ .
Hmm, ok, that's pretty recent. I'm surprised there are so few researcher contributions--but I guess we've all been so busy (and I think I did more labeling in Newberg than Seattle, for example)
I want to update this issue on two points - firstly, Seattle performance.
@infrared0 and I spent an hour last night debugging the Seattle crops, and we can confirm that the issue is not noisy data, it's an issue with our cropper. The labels appear correct when viewed interactively in the Project Sidewalk admin panel, but when we attempt to create crops from them, we end up with crops from the wrong section of the panorama. We loaded things into a jupyter notebook and were able to isolate and reproduce the problem, but haven't been able to find a fix.
We've consulted with @misaugstad, and, I have to say, we're pretty stumped by this. The coordinate systems are (or rather, should be) the same, the labels were exported from the Project Sidewalk database using the exact same format as the Newberg Labels, and the crops were made using the exact same code as the Newberg labels and the DC labels, both of which worked great – we just can't think of any reason why they should be different for Seattle!
I'm going to keep poking around on this today, but I also don't think it's critical enough to prioritize over other aspects – like training models using the Newberg data, which is what I'm about to do. @jonfroehlich if you have different thoughts, please let me know.
The second update (and sorry for being slow on this) - I recomputed the table for the Newberg results presented above, including precision and recall as well as total number of true labels.
Label p r num
sfc_problem 041.05% 071.51 186
null 098.17% 031.80 2195
ramp 041.13% 095.62 754
obstruction 029.06% 072.66 128
missing_ramp 044.75% 073.06 245
Thanks for the informative update. Kinks like this always happen :). Can you please get Anthony to investigate? I’m at a birthday party with the kids right now.
Sent from my iPhone
On Apr 26, 2019, at 10:30 AM, Galen Weld notifications@github.com wrote:
Thanks, I'll poke around with this interface.
My Seattle dump was given to me by Mikey the morning of Sunday Apr 21. (iirc the morning he started using Turkers).
On Fri, Apr 26, 2019 at 10:12 AM Jon Froehlich notifications@github.com wrote:
Hmm, that's not very much Seattle data. How old is your data dump? I think we should have used the most recent dump possible....
To view labels, go to Admin -> Contributions. Look at the 'GSV' column on far right.
[image: image] https://user-images.githubusercontent.com/1621749/56824577-b90bd480-680b-11e9-8a99-549e2aaa40ac.png
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/galenweld/project_sidewalk_ml/issues/18#issuecomment-487131848, or mute the thread https://github.com/notifications/unsubscribe-auth/AACXTSS7KYIBMFXV2BY5DSDPSMZWBANCNFSM4HDQHJUQ .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Aaaand.... we have some really cool results! I still need to run the combined train on Newberg-and-Seattle model, but so far I've run our DC model on the Newberg data, as well as trained two models on Newberg data - one initialized with imagenet weights, just like we've been doing for DC models, and one initialized with the weights from our DC model, which we're using to 'pretrain' the Newberg model.
There's lots to discuss here, and it's worth discussing how we want to present this, and how we want to incorporate it into the narrative, but I put together a quick graph showing our precision and recall for the three models, above.
You can see that even just a small amount of training makes an enourmous different in our nullcrop performance especially, which is interesting - suggests that the "background imagery" in the streetscape, the stuff that isn't pedestrian infrastructure, is more city-specific than the actual infrastructure. Of course, this is still really important to know - we need to be able to distinguish the not-infrastructure from the infrastructure!
However, notice in the recall graph, how Surface Problem recall drops when we train from scratch? This suggests that (presumably because it's a much bigger dataset) we have a better idea of what a Surface Problem is in the DC data than in Newberg. You can see what when we initialize with DC weights, Surface Problem recall goes back up. Neat!
I spent a few hours looking into the Seattle crop issue and I think I found the culprit. The scraper assumes that the GSV images have dimensions 13312x6656, which seems to be true of DC and Newberg. But most Seattle panos have higher resolution with dimensions 16384x8192. This means there are parts of the panorama (on the right/bottom) that we didn't download.
I redownloaded some Seattle panos at the proper resolution and the crops seem much better. (However, it's difficult to for me to confirm that they're actually "correct" since I don't have any label ID's that I can look up in the admin interface).
In any case, I'm going to modify the scraper to auto-detect the proper resolution and re-run the scrapes for Seattle. I'm hoping the scrapes can finish within 24 hours; would that be enough turnaround @galenweld?
wow this is some really really good sleuthing!!
Dang, good find Anthony! I checked the resolution of the panorama scrapes I had, but never thought to check if the scrapes were correct. Thanks so much, and yes, whenever you can get those to me that would be great!
Scraper is still going, seems to be on track to finish late tonight/very early tomorrow morning. It looks like there are a lot more Seattle panos now. Will try to upload as soon as it finishes.
We were able to do this. Yay!
This is mentioned briefly in #16, but I wanted to make sure it has its own discussion thread here, if needed.
It would be great to incorporate Newberg data for either training, or, most likely more usefully, as an additional topic to discuss in the paper, a proof-of-concept to demonstrate that the system works in more cities than just the one from which its training examples where produced. This could also include a potential mention of the cities used in the Tohme dataset; LA, Baltimore, and Saskatoon, as well.