Scalabull / get-tested-covid19

Open source code for community-driven, US-focused COVID-19 test locator database.
https://get-tested-covid19.org
MIT License
7 stars 18 forks source link

Scraper / crawler / UIPath Bot, Automating site data collection #23

Closed zboldyga closed 4 years ago

zboldyga commented 4 years ago

To date, our list of test sites has been created entirely thru form submissions in our Google Form on the website. As the number of test centers grows, this is quickly becoming unrealistic.

We can explore using crawlers to automatically scrape web pages. There are a couple of areas this can be useful:

  1. Discovering new sources of information (websites or specific URLs) that might be good places for humans to look for test centers and any COVID19-related resources. -- In other words, scraping pages and building a database of pages that have keyword matches / have relevance to our project. This set of links can then be used for further scraping, or just manually examined by people on our team.

  2. Discovering new test centers. Scraping pages and showing when there might be a new test center that is not in our existing list of test centers.

  3. Filling-in up-to-date information about test centers. The columns that we are currently collecting for each test center are shown in the attached spreadsheet. We probably won't find all of this information for each test center. An automated bot might speed up this process a lot for new test centers that are found. Scraper Template.xlsx . Note: in the attachment, timestamp is just the time that the test center is added to our list - this field can be ignored.

zboldyga commented 4 years ago

We are in talks with UI Path, who are possibly going to be helping us configure bot(s) for any routine data entry processes. This may not be scraping (depending on legalities), but it might help with our data entry process. I'll provide more information soon.

zboldyga commented 4 years ago

Also: there is at least one other website that has a test center database, so we can potentially crawl their website periodically to make sure ours contains all of their sites. This is probably the easiest option to start with that will have the biggest impact.

zboldyga commented 4 years ago

@SumanAgr13 Is handling this task.