Closed manaswisaha closed 7 years ago
@tongning Could you update this issue with the current status since you worked on it? Then we will have @maddalihanumateja take over while you're on summer internship.
The scraping code I've been using along with instructions is available at https://github.com/tongning/sidewalk-panorama-tools. I haven't used it extensively after the API update however, so it will be important to monitor how many of the downloads fail and see if it behaves any differently for newer imagery.
This is also related to https://github.com/ProjectSidewalk/SidewalkWebpage/issues/633
I've set up an AWS instance with the pano scrapper scheduled for daily runs.
Thanks @maddalihanumateja. Have you contacted UMIACS about refreshing the credit card. Can you start a JIRA ticket if not (email staff@umiacs).
Did you fix the depth scraper issue?
Re: 1. Don't we use the same account for MTurk? If so, we will likely want to put more money on there. Regardless, I think estimating for Oct or even Dec is best.
Do we have clear instructions posted about: (i) how to setup the scraper from scratch on AWS and (ii) how to access the current scraper on AWS and the data?
Yes we are using the same account for MTurk. Anthony's repo has instructions for the setup, these aren't specific to aws/digital ocean. I've added documentation to my list of things to do.
Thanks
Sent from my iPhone
On Jun 19, 2017, at 11:29 AM, Hanuma Teja Maddali notifications@github.com wrote:
Anthony's repo has instructions for the setup, these aren't specific to aws/digital ocean. I've added documentation to my list of things to do.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Out of curiosity, could you summarize what the errors are? Would be interested to know whether they're errors I've encountered.
On Jun 19, 2017 9:05 AM, "Jon Froehlich" notifications@github.com wrote:
Thanks
Sent from my iPhone
On Jun 19, 2017, at 11:29 AM, Hanuma Teja Maddali < notifications@github.com> wrote:
Anthony's repo has instructions for the setup, these aren't specific to aws/digital ocean. I've added documentation to my list of things to do.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/539#issuecomment-309486804, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5i5tbCagftJo2xKNDYVnUoIPz3Uk4rks5sFpwygaJpZM4Mlg0g .
With some Panoramas I'm getting an error with the PIL library function Image.open. (IOError: cannot identify image file).
decode_depthmap terminates and doesnt generate a .depth.txt for old GSV imagery (When we print "Panorama %s is an old image and does not have the tiles for zoom level"). I'm assuming its because decode_depthmap wasnt written to handle these. (Checked the number of these old images. Only did it for 1 trial and this included roughly 4% of the images)
Figured out one of the issues. It looks like we have 36 images with an empty string for panoid {"properties":{"gsv_panorama_id":""}}. This was causing the IOError when downloading panoramas.
Figured out the second issue. The problem wasn't exclusive to the old images, it just looked like that for the small sample I was testing with. The actual cause seems to be discarded panoids. There is no depth map generated because there isnt anything in the image. For example: http://maps.google.com/cbk?output=tile&zoom=3&x=7&y=1&cb_client=maps_sv&fover=2&onerr=3&renderer=spherical&v=4&panoid=YJhxDce6eAvmnDeMqmL8yg . Still trying to get a count the number of discarded pano id. Seems that the termination statements weren't piped to my log file. I do have an estimate from the 656 panos that were on stdout. There were 56 blanks. We can round it off to 10% for now.
For all practical purposes the panoscraper is now up and running on AWS. I need to update the estimate for how many panoid's were discarded by google for which we have labels. Easiest way to do this: If I had a histogram of the size of all the pano files the blanks would all be on the lower end.
This is a good proxy to find the count of blank images. The dropped pano count is still close to 10% of all panos we have labeled. I have a script running to count the actual number of terminations but I doubt that will give us an extremely different result (Final result I got was 1933, assuming no new panoramas were added today).
Higher priority task: Documentation for setting up the pano scrapper on a vm. (Is this even required with the instructions provided on tongning's repo? Reconsidering this)
Low priority tasks:
Hi Teja.
Thanks for this. Are you saying that we are missing 1,933 panoramas? :( Out of how many total? There is literally no way to retrieve these panoramas?
Re: low priority tasks. Please create new tickets for these and start exploring. We need a robust scraper so that we can use and share the data for future computer vision/machine learning projects.
1933 out of roughly 23000. So a little less than 10%. Doesn't seem like there's something we can do to recover the original panos. Just need to keep scraping everyday or on a labeling event.
When did the panos expire? Do we know? If not, when did we get labels for those panos (that should be a reasonable proxy)?
I haven't looked at spatial distribution of these panoramas or temporal distribution of the labels associated with the unavailable panoramas. I'll create issues for these but they should be low priority for now.
Due to recent updates by Google, a lot of the pano ids in the database don't have imagery and the associated metadata (including depth data). Hence, we need to systematically scrape the data periodically as we get new labels of new panoramas.
A solution is running a scraper everyday at a specific time. Once we have this data then we can link the website to this repository when no data is seen available.
Todo: Need to automatically find the imagery for a specific location for pano ids that has no data available.
Related issue: #537