UoA-eResearch / populartimes

Python + selenium scraper to extract google maps places and popular times. Leaflet/plotly JS to display it. For I-Ting Chuang
https://uoa-eresearch.github.io/populartimes/
MIT License
14 stars 2 forks source link

Applying the code for different geolocations #1

Closed a-gregoriades closed 1 year ago

a-gregoriades commented 2 years ago

Hi Nick,

Thanks for the code. I managed to run it using your data in : data.geojson and the locations.csv files.

However, since I would like to tailor the code to collect Popular times data from my geographic area, I am not sure how to generate a file such as your data.geojson which I believe stores the url links for the scraper.

thanks

neon-ninja commented 2 years ago

Hi @a-gregoriades,

The Python scripts that generate data.geojson is scrape.py and util.py. These scripts use selenium (headless browser driving using chromedriver). Essentially these scripts load up Google Maps, searching for strings like "place of interest in North Cape, Far North District, Northland Region", and then extract every listing on every page in the search results for that search. As Google only lets you see about ~320 places for each search, I found it necessary to break down my country of interest (New Zealand) into ~2K suburbs, and run a search within each suburb. As I ran this single threaded (to not risk triggering any temporary IP bans), this took a long time (~1 week). I extracted the names of each suburb from the Stats NZ SA2 dataset (https://datafinder.stats.govt.nz/layer/105161-statistical-area-2-2021-clipped-generalised/), and output it as locations.csv.

scrape.py reads in locations.csv, and uses that for it's list of locations to scrape. scrape.py also records in locations.csv what time the location was scraped at, and those locations are ignored when scrape.py is run, allowing for resumability. Results are stored in data.geojson.

Once these locations have been scraped, updating the popular times data can be done much more efficiently with update_populartimes_data.py

So, to scrape a new set of locations, you need to replace the contents of locations.csv with a list of your locations of interest, delete data.geojson, and run scrape.py. For example, if you set the contents of locations.csv to:

name,scraped_at
"Paris, France",

and ran scrape.py, it should output something like:

Have 1 locations
  0%|                                                                                                                                                                                      | 0/1 [00:00<?, ?it/s]

====== WebDriver manager ======
Current google-chrome version is 91.0.4472
Get LATEST driver version for 91.0.4472
Driver [/home/nyou045/.wdm/drivers/chromedriver/linux64/91.0.4472.101/chromedriver] found in cache
place of interest in Paris, France
                                                                                                                                                                                                                scrolling                                                                                                                                                                                  | 0/20 [00:00<?, ?it/s]
scrolling
Clicking on Tourism France Louvre
Has popular times
                                                                                                                                                                                                                scrolling█████▋                                                                                                                                                                    | 1/20 [00:02<00:49,  2.62s/it]
scrolling
scrolling
Clicking on Paris Autrement
No popular times available
                                                                                                                                                                                                                scrolling██████████████▎                                                                                                                                                           | 2/20 [00:04<00:44,  2.46s/it]
scrolling
scrolling
Clicking on Louvre Pyramid
Has popular times

etc.

Hope that helps!