Census - Githubissues

ptvirgo commented 3 years ago

Census scraper.

zaneselvans commented 3 years ago

Hey this worked great, so long as it's given a long enough timeout to pull down all 250 MB of data. It looked like there's the option to set different thresholds for filesize warnings and download timeouts for different downloaders. Given that we know the approximate max file sizes associated with the different data sets, would it make sense to specify these appropriately for each of the datasets? Like choose some minimum network speed we expect it to accommodate and base them on that? I get the sense that some of these files are larger in general than what scrapy is typically being used to download.

Some naming issues that go beyond the scope of this PR's specific changes:

The package is named pudl which means it will have a namespace collision with the main pudl python package if they are ever imported together, which seems like it will happen at some point, and probably be confusing for whoever does it and then suddenly has one or the other library stop working. Let's rename this package pudlscrapers or something else intuitive so that doesn't happen.
The Census data source needs a more specific name -- there are many different census datasets available and this is just one of them. I believe what we are archiving here is referred to as the "Demographic Profile 1" or "dp1" table, compiled at the tract level (it's available aggregated to many different possible spatial boundaries) so let's assign it the PUDL data code censusdp1tract (i.e. using census as the agency short name, and dp1tract as the dataset abbreviation) which would be used in contexts where we would use ferc1 or eia923 for those datasets.
Working with the datastore now I've noticed that the files which are being archived aren't using a consistent naming convention, and I'm seeing that those filenames are assigned here in the scrapers. The EIA data sources and FERC Form 1 are using the form pudlcode-year.zip but EPA CEMS is year-state.zip with no reference to the data source, and the FERC 714 is using form714.zip. Can we standardize on using pudlcode-majorpart-minorpart.zip? This would get us filenames like ferc714.zip (because there are no partitions) and censusdp1tract-2010.zip and epacems-2011-pa.zip
It appears that Puerto Rico (abbreviation PR) was included as a state in the EPA CEMS data for 2015, at least for a few months, but it's not included in the list of state abbreviations in the EPA CEMS downloader, which results in some non-standard behavior, like filenames that don't follow the same convention as other states in 2015 (you get individual month files like 2015-pr12.zip, rather than a zipfile with all the data for that state and year), and also if you specifically tell the script to download the 2015 PR data, it doesn't get anything:
```
epacems --year 2015 --state pr --verbose --loglevel DEBUG
Missing state from 2015pr12.zip, got pr
Missing state from 2015pr11.zip, got pr
Missing state from 2015pr10.zip, got pr
Missing state from 2015pr09.zip, got pr
Missing state from 2015pr08.zip, got pr
Missing state from 2015pr07.zip, got pr
Downloaded 0 files for year 2015
Download complete: 0 files
```
setup.py names the package PudleScrapers instead of PudlScrapers

ptvirgo commented 3 years ago

That inspired a thought.

It may be worth testing what happens if you set the download timeout really low (say, 2-10 seconds) and see if it can work on a big file (assuming your willing to re-run it manually a few times.) Looking into the documentation and a bit into the code base, it's hard to tell if the timeout is total download time, or how long it will tolerate not getting a response on a download request. If I were a scrapy engineer, I'd treat a timeout as time allowed without making progress, not total download time, because limiting total download time would be annoying.

If it turns out that it's there to prevent network freezing rather than total download time, then the real solution would be to focus on retries.

To be clear, I'm not exactly sure what you saw in your tests and I'm walking through my typical troubleshooting / assumption testing process, so if this seems contrarian or annoying or something it'd be fine by me to choose whatever timeout worked for you.

ptvirgo commented 3 years ago

I'm working on an update to try and implement the details you've pointed out, but it's not ready.

catalyst-cooperative / pudl-scrapers

Census #11