catalyst-cooperative / pudl-scrapers

Scrapers used to acquire snapshots of raw data inputs for versioned archiving and replicable analysis.
MIT License
3 stars 3 forks source link

Census #11

Closed ptvirgo closed 3 years ago

ptvirgo commented 3 years ago

Census scraper.

zaneselvans commented 3 years ago

Hey this worked great, so long as it's given a long enough timeout to pull down all 250 MB of data. It looked like there's the option to set different thresholds for filesize warnings and download timeouts for different downloaders. Given that we know the approximate max file sizes associated with the different data sets, would it make sense to specify these appropriately for each of the datasets? Like choose some minimum network speed we expect it to accommodate and base them on that? I get the sense that some of these files are larger in general than what scrapy is typically being used to download.

Some naming issues that go beyond the scope of this PR's specific changes:

ptvirgo commented 3 years ago

That inspired a thought.

It may be worth testing what happens if you set the download timeout really low (say, 2-10 seconds) and see if it can work on a big file (assuming your willing to re-run it manually a few times.) Looking into the documentation and a bit into the code base, it's hard to tell if the timeout is total download time, or how long it will tolerate not getting a response on a download request. If I were a scrapy engineer, I'd treat a timeout as time allowed without making progress, not total download time, because limiting total download time would be annoying.

If it turns out that it's there to prevent network freezing rather than total download time, then the real solution would be to focus on retries.

To be clear, I'm not exactly sure what you saw in your tests and I'm walking through my typical troubleshooting / assumption testing process, so if this seems contrarian or annoying or something it'd be fine by me to choose whatever timeout worked for you.

ptvirgo commented 3 years ago

I'm working on an update to try and implement the details you've pointed out, but it's not ready.