Create guidelines for dealing with derived values

zaneselvans commented 6 years ago

During ETL, PUDL pulls in raw data directly from a variety of sources, attempts to correct or at least identify reporting errors, normalizes and re-structures the data for easy to maintain storage in a relational database, and adds glue to connect different sources of data together. There's a wide variety of interesting and useful quantities that can be derived from these data, and we need to decide how to make them accessible to users. We need to do this both internally for our own sanity, and also so that contributors and users know how things are meant to work.

There are at least three options here:

Store the derived values directly in the DB, calculating them during ETL.
Use postgresql views that contain the derived values, calculating them as needed, potentially at ETL.
Create an output layer that derives the values and provides them to the user in a dataframe.

Storing derived values in the PUDL DB

Puts all the information in one place.
Ensures all the information is available to anyone who can get the database up and running
Blurs the line between reported and calculated values -- users may not realize which kind of value they're using, especially given the limited infrastructure for storing metadata alongside the DB.
Violates database normalization principles, meaning that we may be more prone to various kinds of data errors or confusion about where something is stored.
Would require all derived values to be pre-computed during ETL, increasing the duration and complexity of that process.

Using Postgresql views

Explicitly differentiates between original and derived values in the database definition, and allow us to create well normalized databases.
Does not automatically calculate the values if anything changes.
Can't be used cleanly within the SQL Alchemy abstraction -- we have to define and interact with the views using direct SQL, in view/table definitions, relationships, create/drop table/view statements, etc.
Still may create confusion for users who simply see table-like entities in the DB and may not understand their provenance.
Limits the complexity of the derived values that can be stored to the kinds of calculations we feel comfortable doing in an SQL environment.

Creating an output layer

Allow users to only calculate the derived values they are interested in, when they want them, and on the subset of the data they are interested in working with.
Ought to reduce the total volume of information that we need to store in the DB.
Clearly separates the original data that's stored in the database from the derived values that are present in the outputs.
Allows more complicated calculations and derived values to be offered (e.g. MCOE calculation)
Can provide users with convenient spreadsheet-like outputs that aren't well normalized, if that's what they would like for their analytical purposes.
Can easily be extended by contributors without their needing to change the underlying structure of the database.
Is a more conceptually complicated system -- would need to clearly document that the preferred way for most users to access the information is through the output layer.
Would provide an abstraction layer that separates the user interface from the underlying data store, which may be useful going forward if we want to change the underlying storage from a local postgres DB to AWS, or offer multiple options, or make other underlying changes.
Could serve as a a simple first version of an API for accessing interesting information within PUDL.

Sniffing the Glue

The "glue" that holds the different data sources together seems like a different kind of derived value -- it's novel information about the structure of the data we're storing and how they relate to each other, and it's a big value add overall. Having it baked into the DB at ETL (as we're now doing with the EIA boiler-generator associations) seems like a very valuable thing, even if it is a bit computationally intensive.

Input please!

From the above I'm sure you can tell that I (Zane) have a preference for the output layer option, but I know @cmgosnell has different feelings and experiences, and we've had many conversations about it, and whatever we do I want to hear from everyone else who is working on this aspect of the project so we can get on the same page about it (especially @karldw, and maybe also @alanawlsn & @gschivley). If anyone wants to respond here with other pros & cons for the above options, or other options altogether, that would be great!

Deliverables?

I think the outcome I'm looking for from this discussion is a design guideline document here in the repository that explains to contributors and users what kind of data we store in the database, and how we provide access to derived quantities, which could be part of a larger document like what @gschivley asked for a while ago -- outlining the process for integrating new data sources into the project.

zaneselvans commented 6 years ago

Note that we have a baby output layer right now, in output.py defined via an output object that knows how to calculate a bunch of derived values -- it calculates only the ones you ask for, as needed, and caches the results as internal data members. We created this for the MCOE calculation, and it's capable of pulling a bunch of the base tables, annotating and outputting multi-sheet workbooks, calculating heat rates, fuel costs, and MCOE at the generator level, aggregating the outputs at timescales other than the reported frequencies... etc. What I'm suggesting would at least initially build upon that and create some kind of guideline for adding new outputs. I've just cleaned up the output object methods a little bit and am testing (see issue #164)

karldw commented 6 years ago

It sounds like the thing that makes the most sense is the output layer, but let me just put in a slightly less dismal viewpoint on views.

With the output layer or the view, I think it makes sense to calculate things on demand, when the user needs them. That means using un-materialized views, which don't have to be manually recalculated when the underlying tables change. There are SQLAlchemy add-ons that handle views. The example on that page just defines the query as an SQL string, rather than via SQLAlchemy operations, but there might be deeper support than that.

gschivley commented 6 years ago

My only concern is that knowledge of necessary transformations might be lost when original data is stored. From experience with CEMS it would be things like multiplying gross load by op_time to get generation within an hour, or the fact that all times are reported in local standard time. Not sure the best way to deal with things like this.

zaneselvans commented 6 years ago

Obviously these lines are fuzzy, but I think doing things like adjusting the reported times to be in UTC would is something we'd want to do in the ETL process -- it feels more like a units conversion or standardization of reported information than the derivation of a new value, whereas something like multiplying op_time by gross load feels like a calculation that's creating new information, which we'd certainly want to include in any compilation of the CEMS data, but which would break the normalization of the table.

@alanawlsn also wants to compile a library of annotations/metadata for all the commonly output columns that can explain whether they are reported or derived, what data source they came from, the units they're in, and some kind of explanation of how they were calculated. Right now that info can be output alongside the tabular output in Excel workbooks (which is what folks we've been handing data off to have wanted to work with) but the idea is that we'd have that metadata available for annotating whatever format we're exporting to.

I wish there were some obvious way to integrate metadata into pandas dataframes, but so far I haven't come across it.

catalyst-cooperative / pudl