ProjectSidewalk / SidewalkWebpage

Project Sidewalk web page
http://projectsidewalk.org
MIT License
84 stars 25 forks source link

Periodically scraping panoramas with their metadata on the server #539

Closed manaswisaha closed 7 years ago

manaswisaha commented 7 years ago

Due to recent updates by Google, a lot of the pano ids in the database don't have imagery and the associated metadata (including depth data). Hence, we need to systematically scrape the data periodically as we get new labels of new panoramas.

A solution is running a scraper everyday at a specific time. Once we have this data then we can link the website to this repository when no data is seen available.

Todo: Need to automatically find the imagery for a specific location for pano ids that has no data available.

Related issue: #537

jonfroehlich commented 7 years ago

@tongning Could you update this issue with the current status since you worked on it? Then we will have @maddalihanumateja take over while you're on summer internship.

tongning commented 7 years ago

The scraping code I've been using along with instructions is available at https://github.com/tongning/sidewalk-panorama-tools. I haven't used it extensively after the API update however, so it will be important to monitor how many of the downloads fail and see if it behaves any differently for newer imagery.

jonfroehlich commented 7 years ago

This is also related to https://github.com/ProjectSidewalk/SidewalkWebpage/issues/633

maddalihanumateja commented 7 years ago

I've set up an AWS instance with the pano scrapper scheduled for daily runs.

jonfroehlich commented 7 years ago

Thanks @maddalihanumateja. Have you contacted UMIACS about refreshing the credit card. Can you start a JIRA ticket if not (email staff@umiacs).

Did you fix the depth scraper issue?

maddalihanumateja commented 7 years ago
  1. Will contact UMIACS today. Do I need to ask them for an amount that will keep the scraper running for a specific period of time, till September maybe?
  2. The depth scraper issue still needs to be looked at. We're getting the depth maps from google but the offline conversion process still isn't working. I'll hopefully be able to fix this today.
  3. There's another issue with the actual code that creates the panorama that I'm more focused on. Stack overflow gave me some fixes which I tried but it still exists. Again, I'm looking into this.
jonfroehlich commented 7 years ago

Re: 1. Don't we use the same account for MTurk? If so, we will likely want to put more money on there. Regardless, I think estimating for Oct or even Dec is best.

Do we have clear instructions posted about: (i) how to setup the scraper from scratch on AWS and (ii) how to access the current scraper on AWS and the data?

maddalihanumateja commented 7 years ago

Yes we are using the same account for MTurk. Anthony's repo has instructions for the setup, these aren't specific to aws/digital ocean. I've added documentation to my list of things to do.

jonfroehlich commented 7 years ago

Thanks

Sent from my iPhone

On Jun 19, 2017, at 11:29 AM, Hanuma Teja Maddali notifications@github.com wrote:

Anthony's repo has instructions for the setup, these aren't specific to aws/digital ocean. I've added documentation to my list of things to do.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

tongning commented 7 years ago

Out of curiosity, could you summarize what the errors are? Would be interested to know whether they're errors I've encountered.

On Jun 19, 2017 9:05 AM, "Jon Froehlich" notifications@github.com wrote:

Thanks

Sent from my iPhone

On Jun 19, 2017, at 11:29 AM, Hanuma Teja Maddali < notifications@github.com> wrote:

Anthony's repo has instructions for the setup, these aren't specific to aws/digital ocean. I've added documentation to my list of things to do.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ProjectSidewalk/SidewalkWebpage/issues/539#issuecomment-309486804, or mute the thread https://github.com/notifications/unsubscribe-auth/AG5i5tbCagftJo2xKNDYVnUoIPz3Uk4rks5sFpwygaJpZM4Mlg0g .

maddalihanumateja commented 7 years ago
  1. With some Panoramas I'm getting an error with the PIL library function Image.open. (IOError: cannot identify image file). pano_scraper_error

  2. decode_depthmap terminates and doesnt generate a .depth.txt for old GSV imagery (When we print "Panorama %s is an old image and does not have the tiles for zoom level"). I'm assuming its because decode_depthmap wasnt written to handle these. (Checked the number of these old images. Only did it for 1 trial and this included roughly 4% of the images)

maddalihanumateja commented 7 years ago

Figured out one of the issues. It looks like we have 36 images with an empty string for panoid {"properties":{"gsv_panorama_id":""}}. This was causing the IOError when downloading panoramas.

maddalihanumateja commented 7 years ago

Figured out the second issue. The problem wasn't exclusive to the old images, it just looked like that for the small sample I was testing with. The actual cause seems to be discarded panoids. There is no depth map generated because there isnt anything in the image. For example: http://maps.google.com/cbk?output=tile&zoom=3&x=7&y=1&cb_client=maps_sv&fover=2&onerr=3&renderer=spherical&v=4&panoid=YJhxDce6eAvmnDeMqmL8yg . Still trying to get a count the number of discarded pano id. Seems that the termination statements weren't piped to my log file. I do have an estimate from the 656 panos that were on stdout. There were 56 blanks. We can round it off to 10% for now.

maddalihanumateja commented 7 years ago

For all practical purposes the panoscraper is now up and running on AWS. I need to update the estimate for how many panoid's were discarded by google for which we have labels. Easiest way to do this: If I had a histogram of the size of all the pano files the blanks would all be on the lower end.

maddalihanumateja commented 7 years ago

pano_size_hist

This is a good proxy to find the count of blank images. The dropped pano count is still close to 10% of all panos we have labeled. I have a script running to count the actual number of terminations but I doubt that will give us an extremely different result (Final result I got was 1933, assuming no new panoramas were added today).

Higher priority task: Documentation for setting up the pano scrapper on a vm. (Is this even required with the instructions provided on tongning's repo? Reconsidering this)

Low priority tasks:

  1. Which locations are associated with these dropped panos
  2. Is the dropped panos issue related to the age of the pano? Is it only a problem if we try to download panos a month after a label was placed? (We could fix the frequency of pano downloading based on this but doesn't seem like its necessary).
jonfroehlich commented 7 years ago

Hi Teja.

Thanks for this. Are you saying that we are missing 1,933 panoramas? :( Out of how many total? There is literally no way to retrieve these panoramas?

Re: low priority tasks. Please create new tickets for these and start exploring. We need a robust scraper so that we can use and share the data for future computer vision/machine learning projects.

maddalihanumateja commented 7 years ago

1933 out of roughly 23000. So a little less than 10%. Doesn't seem like there's something we can do to recover the original panos. Just need to keep scraping everyday or on a labeling event.

jonfroehlich commented 7 years ago

When did the panos expire? Do we know? If not, when did we get labels for those panos (that should be a reasonable proxy)?

maddalihanumateja commented 7 years ago

I haven't looked at spatial distribution of these panoramas or temporal distribution of the labels associated with the unavailable panoramas. I'll create issues for these but they should be low priority for now.