ShelterApp / AddResources

http://shelterapp.org/
11 stars 10 forks source link

Scrape US Libraries data (ILMS) #22

Closed prabhushrikant closed 3 years ago

prabhushrikant commented 3 years ago

Scrape US Library data available here: https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey

Use the latest data from year 2018 but scraper should check if there is newer data from year 2019 or 2020 or future years are available. Data is available in zip file format so , download zip and unzip and work on csv in from unzipped folder.

We can schedule the scraper to look for newer data once a month (?) (please confirm with team).

We need following information from the table: We need to pull LIBNAME(name), address1, city, STABR(state), zip, phone and serviceSummary can be defaulted to “Computers, Internet, Books, Charging Stations, Restrooms”

Scraper should copy the data into tmpILMS collection. Scraper should also compare the data with existing service and tmp* collections for duplicates (using fuzzy search) and copy identified duplicates in tmpILMSDuplicates collection.

prabhushrikant commented 3 years ago

Current code from hackathon is available here: https://github.com/ShelterApp/AddResources/tree/ILMS_scraper

However it's written in python notebook which we need to convert into regular python script. Also script has serious issues with uploading the data to mongo collection and finding duplicates.