EPA EmPOWER Project - Githubissues

grgmiller commented 4 years ago

This project, which was selected as part of the EPA's 2nd Annual EmPOWER Air Data Challenge, aims to develop a new dataset of hourly average emissions factors for the United States, to complement and be published alongside the EPA's eGRID database.

The two objectives of this project are:

Develop an open-source, python-based workflow to re-create the eGRID2018 database, using the existing eGRID2018 methodology.
Develop an "eGRID hourly" dataset using 2019 data, that builds upon the eGRID2018 methodology with new datasets (EIA-930) and methodologies to calculate hourly average emissions factors for each grid region in the U.S.

For both of these objectives, the basic steps are:

Download the data from sources including (EIA-860, EIA-861, EIA-923, EIA-930, EPA CEMS) - including 2019 data
Crosslink/matchup data across these datasets (addressing https://github.com/catalyst-cooperative/pudl/issues/178 and https://github.com/catalyst-cooperative/pudl/issues/535 and https://github.com/catalyst-cooperative/pudl/issues/338)
Clean and adjust the data (including calculating net-to-gross generation ratio https://github.com/catalyst-cooperative/pudl/issues/245)
Aggregate data to the plant level
Calculate emissions factors
Roll this data up to different geographic/grid regions and output Excel tables with final data

Currently, I think that this work will involve creating or editing the following modules in pudl (although looking forward to input on whether this makes sense):

May need to address some issues in the existing ETL process (such as https://github.com/catalyst-cooperative/pudl/issues/595 and https://github.com/catalyst-cooperative/pudl/issues/604)
Need a way to load and clean EIA-930 data per https://github.com/catalyst-cooperative/pudl/issues/466 and https://github.com/catalyst-cooperative/pudl/issues/600 @truggles
analysis.egrid new module that will contain all of the functions to perform steps 2-5 above
package_data/glue/ will contain new crosswalk tables provided by the epa
package_data/epa/egrid/ will contain static tables like emission factors
output.egrid new module for compiling all of the output data and building an excel spreadsheet that will represent the final product (step 6)

Workflow: I will be adding code in a fork located at grgmiller/pudl, and will regularly sync this fork with the sprint20 branch to keep it up to date. I will make periodic pull requests back to the main project.

As I work on each aspect of this project, I will create specific issues to track progress.

More details about the project can be found in the EmPOWER Proposal - eGRID Hourly Emissions Factors.pdf

truggles commented 4 years ago

@grgmiller, when you say "Need a way to load and clean EIA-930 data per #466 and #600", are you including removing the outlier values and then imputing replacements? If the outlier data is sparse enough, simple imputation techniques could suffice. Do you have an impression of these needs?

grgmiller commented 4 years ago

@truggles I'm mostly interested in the net generation by fuel type data from 930, but removing outliers and imputing replacements would be important. I have not yet had a chance to look at all of these data to identify how many outliers there might be, but from the small handful of files I've looked at, most missing values are 1-3 hours, with the occasional whole day of missing data.

Eventually I'm going to want to use the interchange data, but I know from having worked on it in another project that there are a lot of issues with those data.

I think as we try to integrate your existing work on the 930 demand data, it could make sense to try and pull in the net gen data at the same time. Have there been any updates on pulling 930 into PUDL in the past few months?

grgmiller commented 4 years ago

Another important pudl update for this project will be inclusion of 2019 data from EPA and EIA into PUDL. @zaneselvans do you have an timeline yet for when year 2019 data might be available in the data release?

zaneselvans commented 4 years ago

In past years the final EIA 860/923 data has become available in the fall, usually some time in September. As soon as it's all released we'll get on integrating it, and unless something big has changed it should take a week or two to integrate. Not sure how long it would be until that integrated data shows up in a packaged release, but we would make pushing out a new one a priority as soon as the 2019 data is integrated. We wait until the "final" data is released so that we don't have to work around irregularities in the early release, and then go back in and remove those work arounds and deal with different formatting in the final release.

truggles commented 4 years ago

@grgmiller, there have been no updates on our end. We were waiting for a suggestion on how to merge. 1) the easy way where we integrate the already cleaned data, or 2) the more complex way where we integrate the data cleaning code + imputation code into PUDL. As of July, we now have 5 complete years of EIA 930 demand data. We need to clean and impute the newest year. Afterwards would be a perfect time to rekindle this discussion.

zaneselvans commented 4 years ago

Hey @grgmiller what would be a good way for us to chat about how the code is being organized / implementation details? I was just looking through the egrid module that you've got in your fork, and noticed that you're reading directly from the static datapackages, rather than an instantiated database.

@truggles I know @cmgosnell has an email headed your way on the EIA 930 integration.

grgmiller commented 4 years ago

@zaneselvans happy to slack or set up a call to discuss in more detail. Yeah the data reading needs to be fixed: I was pulling from the static datapackages for my other project but will definitely want to change that to pull from the database.

Regarding EIA-930: As we start to move the raw data into the datastore, I would request that we also include the net generation and interchange data (in addition to the demand data). @truggles which data files are you currently working with? The six-month files or the BA/region files? I think for my purposes, I'll need the BA files.

truggles commented 4 years ago

Sorry to just get back to this. @grgmiller, we are working with the BA/region files in UTC here: https://www.eia.gov/opendata/qb.php?category=2122628

For example for CISO: https://www.eia.gov/opendata/qb.php?category=3389957&sdid=EBA.CISO-ALL.D.H

You can see the API CALL TO USE

grgmiller commented 2 years ago

I think that we may be able to close this issue with the release of https://github.com/singularity-energy/open-grid-emissions.

There's still ongoing work to do in improving the dataset, but it may make more sense to track as part of issues in this new repo.

What do you think @cmgosnell @zaneselvans

catalyst-cooperative / pudl

EPA EmPOWER Project #721