calpoly-csai / csai-scraping

Web scraping for Nimbus
4 stars 4 forks source link

Skip scraping of websites that are not responding #10

Closed austinsilveria closed 4 years ago

austinsilveria commented 4 years ago

Currently the sustainer attempts to sequentially run all scrapers no matter what. This poses a problem when one of the websites is not responding, but we still want to scrape the other websites.

Two potential solutions to this problem are the following:

  1. Update the scrapers with common logic to fail gracefully if the website is not responding so that the other scrapers can still complete. If this option is chosen, we would have to represent the missing data in the output CSV so the API knows to keep the old data in the database.
  2. Update the scrapers to each make their own calls to update the database. This would ensure that one scraper failing does not affect the other scrapers in the system, and it would not be necessary to represent missing data because the failing scraper would simply not make an API call.

Option 2. seems like a more robust solution, but it would require coordination with the data team to setup API calls for each scraper separately (should not be a big issue because they are most likely making separate updates under the API that accepts all aggregated data).

cameron-toy commented 4 years ago

The requests library I use to make the web requests allows you to set a timeout. I'll add that with some reasonable defaults... maybe 15 seconds then on to the next scraper?