Closed ptvirgo closed 3 years ago
Hey this worked great, so long as it's given a long enough timeout to pull down all 250 MB of data. It looked like there's the option to set different thresholds for filesize warnings and download timeouts for different downloaders. Given that we know the approximate max file sizes associated with the different data sets, would it make sense to specify these appropriately for each of the datasets? Like choose some minimum network speed we expect it to accommodate and base them on that? I get the sense that some of these files are larger in general than what scrapy is typically being used to download.
Some naming issues that go beyond the scope of this PR's specific changes:
pudl
which means it will have a namespace collision with the main pudl
python package if they are ever imported together, which seems like it will happen at some point, and probably be confusing for whoever does it and then suddenly has one or the other library stop working. Let's rename this package pudlscrapers
or something else intuitive so that doesn't happen.censusdp1tract
(i.e. using census
as the agency short name, and dp1tract
as the dataset abbreviation) which would be used in contexts where we would use ferc1
or eia923
for those datasets.pudlcode-year.zip
but EPA CEMS is year-state.zip
with no reference to the data source, and the FERC 714 is using form714.zip
. Can we standardize on using pudlcode-majorpart-minorpart.zip
? This would get us filenames like ferc714.zip
(because there are no partitions) and censusdp1tract-2010.zip
and epacems-2011-pa.zip
2015-pr12.zip
, rather than a zipfile with all the data for that state and year), and also if you specifically tell the script to download the 2015 PR data, it doesn't get anything:
epacems --year 2015 --state pr --verbose --loglevel DEBUG
Missing state from 2015pr12.zip, got pr
Missing state from 2015pr11.zip, got pr
Missing state from 2015pr10.zip, got pr
Missing state from 2015pr09.zip, got pr
Missing state from 2015pr08.zip, got pr
Missing state from 2015pr07.zip, got pr
Downloaded 0 files for year 2015
Download complete: 0 files
setup.py
names the package PudleScrapers
instead of PudlScrapers
That inspired a thought.
It may be worth testing what happens if you set the download timeout really low (say, 2-10 seconds) and see if it can work on a big file (assuming your willing to re-run it manually a few times.) Looking into the documentation and a bit into the code base, it's hard to tell if the timeout is total download time, or how long it will tolerate not getting a response on a download request. If I were a scrapy engineer, I'd treat a timeout as time allowed without making progress, not total download time, because limiting total download time would be annoying.
If it turns out that it's there to prevent network freezing rather than total download time, then the real solution would be to focus on retries.
To be clear, I'm not exactly sure what you saw in your tests and I'm walking through my typical troubleshooting / assumption testing process, so if this seems contrarian or annoying or something it'd be fine by me to choose whatever timeout worked for you.
I'm working on an update to try and implement the details you've pointed out, but it's not ready.
Census scraper.