Open grgmiller opened 4 years ago
@grgmiller, when you say "Need a way to load and clean EIA-930 data per #466 and #600", are you including removing the outlier values and then imputing replacements? If the outlier data is sparse enough, simple imputation techniques could suffice. Do you have an impression of these needs?
@truggles I'm mostly interested in the net generation by fuel type data from 930, but removing outliers and imputing replacements would be important. I have not yet had a chance to look at all of these data to identify how many outliers there might be, but from the small handful of files I've looked at, most missing values are 1-3 hours, with the occasional whole day of missing data.
Eventually I'm going to want to use the interchange data, but I know from having worked on it in another project that there are a lot of issues with those data.
I think as we try to integrate your existing work on the 930 demand data, it could make sense to try and pull in the net gen data at the same time. Have there been any updates on pulling 930 into PUDL in the past few months?
Another important pudl update for this project will be inclusion of 2019 data from EPA and EIA into PUDL. @zaneselvans do you have an timeline yet for when year 2019 data might be available in the data release?
In past years the final EIA 860/923 data has become available in the fall, usually some time in September. As soon as it's all released we'll get on integrating it, and unless something big has changed it should take a week or two to integrate. Not sure how long it would be until that integrated data shows up in a packaged release, but we would make pushing out a new one a priority as soon as the 2019 data is integrated. We wait until the "final" data is released so that we don't have to work around irregularities in the early release, and then go back in and remove those work arounds and deal with different formatting in the final release.
@grgmiller, there have been no updates on our end. We were waiting for a suggestion on how to merge. 1) the easy way where we integrate the already cleaned data, or 2) the more complex way where we integrate the data cleaning code + imputation code into PUDL. As of July, we now have 5 complete years of EIA 930 demand data. We need to clean and impute the newest year. Afterwards would be a perfect time to rekindle this discussion.
Hey @grgmiller what would be a good way for us to chat about how the code is being organized / implementation details? I was just looking through the egrid module that you've got in your fork, and noticed that you're reading directly from the static datapackages, rather than an instantiated database.
@truggles I know @cmgosnell has an email headed your way on the EIA 930 integration.
@zaneselvans happy to slack or set up a call to discuss in more detail. Yeah the data reading needs to be fixed: I was pulling from the static datapackages for my other project but will definitely want to change that to pull from the database.
Regarding EIA-930: As we start to move the raw data into the datastore, I would request that we also include the net generation and interchange data (in addition to the demand data). @truggles which data files are you currently working with? The six-month files or the BA/region files? I think for my purposes, I'll need the BA files.
Sorry to just get back to this. @grgmiller, we are working with the BA/region files in UTC here: https://www.eia.gov/opendata/qb.php?category=2122628
For example for CISO: https://www.eia.gov/opendata/qb.php?category=3389957&sdid=EBA.CISO-ALL.D.H
You can see the API CALL TO USE
I think that we may be able to close this issue with the release of https://github.com/singularity-energy/open-grid-emissions.
There's still ongoing work to do in improving the dataset, but it may make more sense to track as part of issues in this new repo.
What do you think @cmgosnell @zaneselvans
This project, which was selected as part of the EPA's 2nd Annual EmPOWER Air Data Challenge, aims to develop a new dataset of hourly average emissions factors for the United States, to complement and be published alongside the EPA's eGRID database.
The two objectives of this project are:
For both of these objectives, the basic steps are:
Currently, I think that this work will involve creating or editing the following modules in pudl (although looking forward to input on whether this makes sense):
analysis.egrid
new module that will contain all of the functions to perform steps 2-5 abovepackage_data/glue/
will contain new crosswalk tables provided by the epapackage_data/epa/egrid/
will contain static tables like emission factorsoutput.egrid
new module for compiling all of the output data and building an excel spreadsheet that will represent the final product (step 6)Workflow: I will be adding code in a fork located at grgmiller/pudl, and will regularly sync this fork with the
sprint20
branch to keep it up to date. I will make periodic pull requests back to the main project.As I work on each aspect of this project, I will create specific issues to track progress.
More details about the project can be found in the EmPOWER Proposal - eGRID Hourly Emissions Factors.pdf