zaneselvans commented 3 years ago

Description

Currently we provide access to more human-readable denormalized outputs using software routines. This adds a layer of complexity and requires users to use Python. It's also kind of slow. Instead, for simple derived values and denormalized tables we can provide this type of output by defining views (stored queries) inside the databases we generate and distribute.

Motivation

We want to maximize the ability of users to access useful outputs while minimizing dependence on particular software and platforms.
Reduce the complexity of the system that users have to learn in order to access the data. Just downloading the database, vs. downloading the database and then needing to run our software or write their own SQL queries to construct a readable table with commonly used data.
Provide faster access to these outputs -- using SQL inside the DB rather than Pandas to construct the denormalized tables.
How does it align with our cooperative goals or grant goals?
Can you think of any reasons why we should not pursue or prioritize this project?
Is this project blocking something?
Who is this for?

In Scope

Database views which replace all the PudlTabl output methods corresponding to individual data tables like fuel_receipts_costs_eia923 or plants_steam_ferc1.
Unaggregated, as well monthly and annual aggregations of our data tables, as currently provided by the PudlTabl class.
Joining of entity attributes with data tables based on well defined foreign key relationships.
Calculation of simple derived values using arithmetic that we can easily perform in SQL (e.g. multiplying fuel heat content per unit and number of units to get a total heat content).
Adjustment of more complex derived value calculations (e.g. heat rate estimates, net generation allocations) or value imputations done in software to use the new database views.

Out of Scope

Integration of more complex derived values directly into the database (e.g. heat rate estimates, net generation allocations).
Imputation of missing values or the use of external data sources (like the EIA API).

Breaking API Changes

Initially we can modify the PudlTabl class to access the database views directly rather than doing its own calculations and joins, but in the long run as we move to providing access via Intake catalogs or the DB directly, we will probably want to deprecate this access method.

cmgosnell commented 1 year ago

I'm curious if we could use the database schema to generate a lot of these de-normalized output tables. This might be a more complicated approach that I'm definitely not attached to, but it keep worming in my head.

If most of these de-normalized tables in the outputs right now are just merging additional tables into each core table for each of the outputs based on FK relationships (i.e. merge in plants_entities_eia into the generators_eia860 table on plant_id_eia for the gens_eia860 output), it seems like we could probably generate all of these merges based on the DB schema.

A complication I see here is which columns from each of these merged tables to actually keep. If we are planning on generating metadata resources for each of these output tables, this problem could be pretty easily solved by enforce_schema.

bendnorman commented 1 year ago

I've moved most of this content to #1973. @cmgosnell could you repost the comment above in #1973? Thanks!

catalyst-cooperative / pudl

Add Database Views #1178

Description

Motivation

In Scope

Out of Scope

Breaking API Changes