We are about to scrape some rather large websites full of unstructured data to be collected. This will likely lead to many trial-and-error requests made to them and we might get throttled by these websites.
Ideally, we won't have to execute the same request twice. We need to set up either a proxy server with large (and persistent) cache, or to mirror the entire site.
Since setting up a proxy raises a few (minor) complications for the scraping environment (e.g. local resolv.conf updates for each environment hitting the cache and forgetting to do so would void the effort), we decided to attempt to mirror the entire websites on a server we own, then hit those clones instead, as much as we want. Once everything is set up and working (and we have completely ran the scrapers on the entire site) we can remove them and hit the real ones instead.
Tasks:
[ ] set up a server somewhere, storage doesn't matter, we are only scraping metadata
[ ] make sure access to that server is allowed on SSH, HTTP and HTTPS ports
[ ] make sure have lots of bandwidth and no (or super high) traffic quota
[ ] document the method used for mirroring
[ ] start the mirroring process and let it run until finish
[ ] if possible, monitor in the meantime so it won't die 2 hrs later and we only see it Monday :sweat_smile:
We are about to scrape some rather large websites full of unstructured data to be collected. This will likely lead to many trial-and-error requests made to them and we might get throttled by these websites.
Ideally, we won't have to execute the same request twice. We need to set up either a proxy server with large (and persistent) cache, or to mirror the entire site.
Since setting up a proxy raises a few (minor) complications for the scraping environment (e.g. local
resolv.conf
updates for each environment hitting the cache and forgetting to do so would void the effort), we decided to attempt to mirror the entire websites on a server we own, then hit those clones instead, as much as we want. Once everything is set up and working (and we have completely ran the scrapers on the entire site) we can remove them and hit the real ones instead.Tasks:
List of sites to be mirrored TBD later today.