CivicActions / edscrapers

US Department of Education Data Scraping Kit; see https://us-ed-scraping.ckan.io/dataset
GNU Affero General Public License v3.0
15 stars 9 forks source link

Use a mirror for scraping sources to avoid hitting them more than necessary #2

Closed nightsh closed 4 years ago

nightsh commented 4 years ago

We are about to scrape some rather large websites full of unstructured data to be collected. This will likely lead to many trial-and-error requests made to them and we might get throttled by these websites.

Ideally, we won't have to execute the same request twice. We need to set up either a proxy server with large (and persistent) cache, or to mirror the entire site.

Since setting up a proxy raises a few (minor) complications for the scraping environment (e.g. local resolv.conf updates for each environment hitting the cache and forgetting to do so would void the effort), we decided to attempt to mirror the entire websites on a server we own, then hit those clones instead, as much as we want. Once everything is set up and working (and we have completely ran the scrapers on the entire site) we can remove them and hit the real ones instead.

Tasks:

List of sites to be mirrored TBD later today.

Daniellappv commented 4 years ago

this is no longer needed, yeah? @nightsh

nightsh commented 4 years ago

No longer needed, closing.