Closed kylebd99 closed 1 year ago
Unfortunately this often fails, because there are 1300+ files and several GB of data in the archive. For this reason, we're mostly relying on the Zenodo archives as permanent "cold storage" for the original data, and using it to populate either a local or cloud storage cache, which we actually use to run the ETL. You can do that too, or you can use the already processed data if you don't have a particular need to run the ETL yourself.
To pull the raw data down from Zenodo to a local cache, you can do:
pudl_datastore --dataset epacems
This is still pulling the data from Zenodo, so it'll still be a bit flaky and slow. You may need to run it several times (but each time it will only download additional data it didn't already get). Once all of the years of data have been downloaded, when you run the ETL, it'll use the local cache.
Alternatively you can pull from our public google cloud storage cache. To do this you'll need to be on the dev
branch, and you'll need to have a google cloud account set up for billing (they offer $300 of free credits when you set up a new account) because the publicly cached data is "requester pays" so we don't get hammered with a bunch of data egress fees if someone is automatically downloading the data. In North America the egress fees are like $0.25/GB so the entire CEMS dataset is around $1 to download this way (to Google, not us!). This method is fast and much more reliable.
You can either pre-download:
pudl_datastore --gcs-cache-path gs://zenodo-cache.catalyst.coop --dataset epacems
Or just tell the ETL that it should obtain its data this way directly (which will also cache it locally for future use):
pudl_etl --gcs-cache-path gs://zenodo-cache.catalyst.coop settings/etl_full.yml
If you'd like to just use the preprocessed data, you can access it through the PUDL Intake Data Catalog (again, try the dev branch...). This will also require setting up google cloud billing / authentication.
If you don't actually need the EPA CEMS data, you can remove it from / comment it out in the settings file and the ETL will run fine (and you can run just the CEMS part of the ETL later if you want to, against an existing PUDL DB. Or, you can access the PUDL and raw FERC Form 1 DBs from our Datasette deployment: https://data.catalyst.coop
Also: the reason you get a "file not found" when you try to go to the API URL that's failing isn't that the file isn't there, it's that it's only accessible if you're authenticated with a Zenodo API key, which would have to be sent to the webserver in the request / headers.
This worked great! I had to restart the pudl_datastore command a couple times like you said, but it definitely managed it eventually.
Describe the bug
In the second step of the ETL pipeline "pudl_etl settings/etl_full.yml", there is an error which causes it to crash from a read timeout. Specifically, it occurs when it tries to download from "https://zenodo.org/api/files/19847a4e-f9d1-4b7a-840a-69b88e751a0e/epacems-1997-mo.zip". Putting this into a browser indicate that this file doesn't exist in zenodo ("The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again."), so maybe the file structure was changed recently without the ETL being updated?
The full log from the command is as follows:
Bug Severity
How badly is this bug affecting you? High: This bug is preventing me from using PUDL.
To Reproduce
This occurred while following the ETL steps from here after setting up the dev environment according to here. Specifically, it happens in the second step "pudl_etl settings/etl_full.yml" where the settings file was generated by pudl_setup.
Expected behavior
Hopefully, the ETL pipeline would run
Software Environment?
git clone
,pip
, orconda
) git cloneAdditional context
Attached the settings file used (had to rename to .txt to attach). etl_full.txt