catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Create guidelines for dealing with derived values #191

Closed zaneselvans closed 4 years ago

zaneselvans commented 6 years ago

During ETL, PUDL pulls in raw data directly from a variety of sources, attempts to correct or at least identify reporting errors, normalizes and re-structures the data for easy to maintain storage in a relational database, and adds glue to connect different sources of data together. There's a wide variety of interesting and useful quantities that can be derived from these data, and we need to decide how to make them accessible to users. We need to do this both internally for our own sanity, and also so that contributors and users know how things are meant to work.

There are at least three options here:

  1. Store the derived values directly in the DB, calculating them during ETL.
  2. Use postgresql views that contain the derived values, calculating them as needed, potentially at ETL.
  3. Create an output layer that derives the values and provides them to the user in a dataframe.

Storing derived values in the PUDL DB

Using Postgresql views

Creating an output layer

Sniffing the Glue

The "glue" that holds the different data sources together seems like a different kind of derived value -- it's novel information about the structure of the data we're storing and how they relate to each other, and it's a big value add overall. Having it baked into the DB at ETL (as we're now doing with the EIA boiler-generator associations) seems like a very valuable thing, even if it is a bit computationally intensive.

Input please!

From the above I'm sure you can tell that I (Zane) have a preference for the output layer option, but I know @cmgosnell has different feelings and experiences, and we've had many conversations about it, and whatever we do I want to hear from everyone else who is working on this aspect of the project so we can get on the same page about it (especially @karldw, and maybe also @alanawlsn & @gschivley). If anyone wants to respond here with other pros & cons for the above options, or other options altogether, that would be great!

Deliverables?

I think the outcome I'm looking for from this discussion is a design guideline document here in the repository that explains to contributors and users what kind of data we store in the database, and how we provide access to derived quantities, which could be part of a larger document like what @gschivley asked for a while ago -- outlining the process for integrating new data sources into the project.

zaneselvans commented 6 years ago

Note that we have a baby output layer right now, in output.py defined via an output object that knows how to calculate a bunch of derived values -- it calculates only the ones you ask for, as needed, and caches the results as internal data members. We created this for the MCOE calculation, and it's capable of pulling a bunch of the base tables, annotating and outputting multi-sheet workbooks, calculating heat rates, fuel costs, and MCOE at the generator level, aggregating the outputs at timescales other than the reported frequencies... etc. What I'm suggesting would at least initially build upon that and create some kind of guideline for adding new outputs. I've just cleaned up the output object methods a little bit and am testing (see issue #164)

karldw commented 6 years ago

It sounds like the thing that makes the most sense is the output layer, but let me just put in a slightly less dismal viewpoint on views.

With the output layer or the view, I think it makes sense to calculate things on demand, when the user needs them. That means using un-materialized views, which don't have to be manually recalculated when the underlying tables change. There are SQLAlchemy add-ons that handle views. The example on that page just defines the query as an SQL string, rather than via SQLAlchemy operations, but there might be deeper support than that.

gschivley commented 6 years ago

My only concern is that knowledge of necessary transformations might be lost when original data is stored. From experience with CEMS it would be things like multiplying gross load by op_time to get generation within an hour, or the fact that all times are reported in local standard time. Not sure the best way to deal with things like this.

zaneselvans commented 6 years ago

Obviously these lines are fuzzy, but I think doing things like adjusting the reported times to be in UTC would is something we'd want to do in the ETL process -- it feels more like a units conversion or standardization of reported information than the derivation of a new value, whereas something like multiplying op_time by gross load feels like a calculation that's creating new information, which we'd certainly want to include in any compilation of the CEMS data, but which would break the normalization of the table.

@alanawlsn also wants to compile a library of annotations/metadata for all the commonly output columns that can explain whether they are reported or derived, what data source they came from, the units they're in, and some kind of explanation of how they were calculated. Right now that info can be output alongside the tabular output in Excel workbooks (which is what folks we've been handing data off to have wanted to work with) but the idea is that we'd have that metadata available for annotating whatever format we're exporting to.

I wish there were some obvious way to integrate metadata into pandas dataframes, but so far I haven't come across it.