Closed prabhushrikant closed 3 years ago
Current code from hackathon is available here: https://github.com/ShelterApp/AddResources/tree/ILMS_scraper
However it's written in python notebook which we need to convert into regular python script. Also script has serious issues with uploading the data to mongo collection and finding duplicates.
Scrape US Library data available here: https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey
Use the latest data from year 2018 but scraper should check if there is newer data from year 2019 or 2020 or future years are available. Data is available in zip file format so , download zip and unzip and work on csv in from unzipped folder.
We can schedule the scraper to look for newer data once a month (?) (please confirm with team).
We need following information from the table: We need to pull LIBNAME(name), address1, city, STABR(state), zip, phone and serviceSummary can be defaulted to “Computers, Internet, Books, Charging Stations, Restrooms”
Scraper should copy the data into tmpILMS collection. Scraper should also compare the data with existing service and tmp* collections for duplicates (using fuzzy search) and copy identified duplicates in tmpILMSDuplicates collection.