catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Working with EIA data in parquet #543

Closed grgmiller closed 4 years ago

grgmiller commented 4 years ago

I am trying to load eia data (such as the eia 923 boiler fuel data) into a dask dataframe after running epacems_to_parquet, but it looks like the only data in my parquet directory is the hourly epacems data.

I downloaded the pudl-eia860-eia923-epacems.tgz file from zenodo and successfully ran epacems_to_parquet datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json to create a parquet db of the epacems data.

Do I need to run a separate process on the epacems datapkg to convert the eia data to sqlite and then load it into a dataframe from sqlite? e.g. datapkg_to_sqlite datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json

Do I need to separately download pudl-eia860-eia923.tgz and then run the datapkg_to_sqlite on that datapackage?

zaneselvans commented 4 years ago

Sorry, we clearly need to document all this better.

The epacems_to_parquet script only converts the one big table: hourly_emissions_epacems which has nearly a billion records in it into the Parquet format. The other data (EIA 860/923) will need to be loaded into the SQLite database using datapkg_to_sqlite, using the data package that does no contain EPA CEMS data.

For the sake of simplicity, even though it's a lot of data, I would recommend just downloading the whole archive from Zenodo and running the load-pudl.sh script, which will load absolutely everything for you.

grgmiller commented 4 years ago

Thank you!