Open truggles opened 4 years ago
Hey @truggles! @zaneselvans was actually just talking about using your data cleaning methodologies today. It would be great to have the cleaned 930 incorporated into PUDL. We have gone through some restructuring although we haven't made new "how to add a new data set" docs. We are mid-transition to using zenodo as our main datastore for raw data inputs which @ptvirgo has been bottom lining - this is all happening on the datastore2
branch.
Were you wanting to add your cleaned data (i.e. from the zenodo archive) or would you want to incorporate the full data processing pipeline into PUDL? I assume the former would be better but it is a bit outside of our existing process.
Maybe we should schedule a call or something and we can talk through what you are thinking and implementation :-)
I was thinking we would be incorporating the already cleaned data from the above Zenodo archive. While we worked to make the process consist of just a few notebooks there are still some complications with mixing python and R in the same workflow. And, for the moment, I feel much better knowing someone has looked at the results of the cleaning before they are public.
Let's discuss on a call. I'll email you two.
Thus far we've tried to focus on datasets that don't already have good programmatic accessibility, to minimize the collection of things that we end up responsible for maintaining, and to avoid having people be confused about where to get the real data. The EIA 930 has a good API doesn't it? What kind of processing are you doing?
We're also trying to give folks transparency into the processing that's being done by having it all laid out in the PUDL repository, so it might feel a little weird to be pulling in another post-processed dataset with all the logic that's doing the processing stored elsewhere. Is there a major impediment to doing all of the processing in Python? Is it functionality that only exists in R right now?
@truggles and @cmgosnell I've been using EIA-930 data recently in combination with CEMS data from PUDL and would be happy to help contribute to this.
It would be great if we could get the time-series imputation stuff working all in Python and a bit generalized for our use cases, since we're going to have to do it on other datasets too I imagine -- I'm working with the FERC 714 historical hourly demand data right now and it's definitely got plenty of messes that need cleaned up. I haven't looked at the CEMS data in enough detail to know what kinds of outliers / missing values it has. @grgmiller or @karldw do you have a sense? Could this process be useful in that context too?
For CEMS, I've found that the heat input data is pretty reliable, but the CO2 mass is sometimes missing. In these cases, I've written code to assign a fuel type to each unit for each time period based on eia923 boiler fuel data and multiplying the heat input by a fuel-specific emission factor. Also, some generators are only required to report during certain seasons, but I haven't really dug into how you might impute that missing data, if even possible.
We had looked into conducting the imputation in python and did not originally find a package that provided the same flexibility as the one we used in R called mice. We used a Multiple Imputation by Chained Equations (MICE) method. Statsmodels has a version as do some other python packages.
It might be possible to explain what we did to the Statsmodels community to see if they can think of how to replicate what we did with their algos.
PUDL already depends on scikit-learn and have some familiarity with their usage patterns, and they also have an iterative imputation estimator that they say was adapted from the R MICE implementation. It looks like it's tagged "experimental" so maybe it's new? Can it do the kind of imputation you're doing? I wonder if we could turn the whole outlier identification and imputation process into a single scikit-learn Pipeline
composed of several column transformers?
Also, some generators are only required to report during certain seasons, but I haven't really dug into how you might impute that missing data, if even possible.
This reporting requirement has also changed over time and varies across states. The changes are caused by changing in regulations and reporting requirements.
I recall the CEMS data having a small number of very bad outliers. We're already doing a few corrections, e.g. this one for gross_load_mw
and filling NAs with zero for in gross_load_mw
and heat_content_mmbtu
.
@karldw from what you've seen where gross_load_mw
is missing, is there a heat_input_mmbtu
that is available, or are they typically missing together? I'm thinking that missing gross_load|_mw
could be filled by developing a heat rate model for each plant based on historic data.
@grgmiller, they're very often observations that report zero operating hours. See this conversation: https://github.com/catalyst-cooperative/pudl/issues/171#issuecomment-449230223
I would like to add cleaned EIA form 930 hourly demand data to your database. We have talked about this many months ago. We have a Zenodo archive with the cleaned data that is ready for use: https://zenodo.org/record/3690240
I would be happy to work on incorporating this. I heard there may have been some restructuring since we discussed this at the NREL OpenMod workshop @cmgosnell. Should I simply read the latest docs on contributing to refresh my memory and start working?