Closed grgmiller closed 4 years ago
Sorry, we clearly need to document all this better.
The epacems_to_parquet
script only converts the one big table: hourly_emissions_epacems
which has nearly a billion records in it into the Parquet format. The other data (EIA 860/923) will need to be loaded into the SQLite database using datapkg_to_sqlite
, using the data package that does no contain EPA CEMS data.
For the sake of simplicity, even though it's a lot of data, I would recommend just downloading the whole archive from Zenodo and running the load-pudl.sh
script, which will load absolutely everything for you.
Thank you!
I am trying to load eia data (such as the eia 923 boiler fuel data) into a dask dataframe after running epacems_to_parquet, but it looks like the only data in my parquet directory is the hourly epacems data.
I downloaded the pudl-eia860-eia923-epacems.tgz file from zenodo and successfully ran
epacems_to_parquet datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json
to create a parquet db of the epacems data.Do I need to run a separate process on the epacems datapkg to convert the eia data to sqlite and then load it into a dataframe from sqlite? e.g.
datapkg_to_sqlite datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json
Do I need to separately download pudl-eia860-eia923.tgz and then run the datapkg_to_sqlite on that datapackage?