catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Create design doc for naming convention of PUDL tables #2517

Closed bendnorman closed 10 months ago

bendnorman commented 1 year ago

It sounds like users favor ripping out PudlTabl ASAP and accessing the table directly from the database #2503. Given we won't have PudlTabl method to act as aliases for the actual table names, we need to decide on a naming convention for our denormalized tables. We'll likely have to update the normalized table names because the ideal denormalized table names might conflict with the existing normalized table names. This is a scary change!

### Scope
- [x] what the naming conventions are
- [x] why we decided on them
- [x] how we will implement/roll out the changes
### Tasks
- [x] draft a proposal document that covers: "what is the problem we're trying to solve? how are we solving it? why did we choose this way? what are some other approaches we considered and didn't do?"
- [x] circulate for feedback: what worries you about this approach? what excites you?
- [x] incorporate feedback / adapt plan
- [x] get sign-off from whole team
- [ ] Create a github discussion that summarizes the design doc

Questions

e-belfer commented 1 year ago

Some of the discussion in #2275 should be relevant here. Right now IIRC the normalized tables are the only tables which don't have some kind of qualifying suffix (e.g. the referent boiler_fuel_eia923 table is the normalized one).

In terms of updating naming conventions, the name should convey the following four pieces of information:

  1. Table name: should be obvious to users familiar with FERC/EIA data what its analogue in the raw data is, if there is one. If not, theme of content should be relatively clear from the name (e.g. plant_parts_eia).
  2. Data source: making it clear what source the data originates from (e.g. EIA 860, or EIA if combining data from multiple forms)
  3. Stage of processing: raw, clean, normalized, denormalized
  4. Any other modifications, if we decide to save multiple variations of a table (e.g. imputed vs. not, annual vs monthly).

While doing this, we should aim for the following:

I'd propose the following, which is a bit of a departure from our current naming protocol in the dag but I think would be clearest: {tablename}_{source}_{processing_stage}_{extras} e.g. plants_steam_ferc1_denorm or generation_eia923_denorm_annual

Pros of this approach:

e-belfer commented 1 year ago

One amendment to the above: the denormalized tables are the ones that will be most widely used. We'll want to use the simplest and clearest naming convention for them, so this may mean we want to drop the denorm suffix here (rather than what we currently do, which is drop it for the normalized tables).

bendnorman commented 1 year ago

This seems like an accurate assessment of our current naming issues @e-belfer; thank you! I agree our "normalized" tables are the only tables / assets lacking a suffix that describes the stage of the data.

Our naming conventions will also affect how we categorize our assets, code, tables and metadata. There are a few ways we are grouping assets:

I could see us adopting a file structure organized by processing stage and source:

.
└── src/
    └── pudl/
        ├── eia/
        │   ├── eia860/
        │   │   ├── raw.py
        │   │   └── clean.py
        │   ├── eia923/
        │   │   ├── raw.py
        │   │   └── clean.py
        │   ├── norm.py
        │   └── denorm.py
        ├── ferc/
        │   ├── ferc1/
        │   │   ├── raw.py
        │   │   └── norm.py
        │   ├── denorm.py
        │   └── ...
        ├── epa/
        │   └── ...
        └── glue/
            └── ...
bendnorman commented 1 year ago

We should base our table naming convention on standard data modeling techniques. Here are a few questions we should answer:

  1. What types of tables do we want our users to interact with? Should we have wide tables for each data source and entity? Should we create wide tables that integrate all data from all source for each entity?
  2. Does our data subscribe to a standard data warehouse modeling method now?
  3. Can our normalized and denormalized tables be further categorized? Can we use these categories for table and asset group naming conventions?
bendnorman commented 1 year ago

I think our normalized data roughly follows the popular Star Schema. The star schema organizes data into fact and dimension tables.

The fact table in a star schema contains the measures or metrics that are of interest to the user or organization. For example, in a sales data warehouse, the fact table might contain sales revenue, units sold, and profit margins. Each record in the fact table represents a specific event or transaction, such as a sale or order.

The dimension tables in a star schema contain the descriptive attributes of the measures in the fact table. These attributes are used to slice and dice the data in the fact table, allowing users to analyze the data from different perspectives. For example, in a sales data warehouse, the dimension tables might include product, customer, time, and location.

We have tables where each row measures a specific event (a report from a generator, plant, utility, or respondent), similar to a fact table. Our entity tables contain attributes that describe the events in the fact tables, similar to dimension tables. For example, the generators_eia860 table contains annually varying generator attributes (fact) and the generators_entity_eia contains static generator attributes (dimension). Most FERC Form 1 tables are fact tables where each row is a report and the utilities_ferc1 is a dimension table that describes the utilities in the reports.

These fact and dimension tables can then be combined to create various derived tables that summarize and aggregate the data. I still need to find more examples/documentation on derived tables to recommend naming conventions for all of our "denormalized" tables.

I'm trying to categorize our existing tables to confirm they map to star schema concepts. We discussed how these topics apply to PUDL in a github discussion from September 2022.

zaneselvans commented 1 year ago

Slowly Varying Dimensions

A lot of our dimensons are slowly changing. More dimension-like than fact-like, but not truly static, so we end up with almost all of the generator / boiler / plant attributes showing up in the annual rather then entity tables. How do those kinds of tables typically fit into data warehouse schemas?

Organizing Assets

I guess we're really trying to come up with an asset naming convention, with a subset of the assets also being database tables (and other assets possibly becoming database tables at some future date if they're clearly useful in the that context).

What is the intended method of organizing / grouping assets in Dagster? Initially I had thought that the asset_group was going to be flexible and easy to use, but it seems like it's mostly intended to be a coarse DAG visualization tag, and the tooling is more robust around module & subpackage based groupings, which can be hierarchically organized and are easy to select groups of assets from. This also seems to be in line with Dagster's recommended project layout. I think I may have been confused about how assets are supposed to be accessed and organized by their fully featured project example, which has 3 asset groups that map directly to 3 subpackages, with no more complicated nesting or relationships. If it's going to be easiest to work with and refer to assets based on the module / subpackage that they're defined within then the kind of data source and processing stage hierarchy that Ben outlined above seems desirable. At least in my work so far I've much more frequently been trying to work with groups of assets within a particular data source end-to-end rather than a particular processing stage. This would invert our current module hierarchy, but that's a change I've wanted to make for a long time, in part because many of our helpers / tooling is dataset specific and this would allow us to keep e.g. the DBF extraction code close to the FERC data and EIA spreadsheet specific cleaning functions close to the EIA data. I think we also have more natural hierarchies within the data sources than within the linear cleaning pipelines. So we have a bunch of EIA or FERC data sources which can be grouped together and share many quirks.

New Table Names

Alongside the 4 naming chunks in {tablename}_{source}_{processing_stage}_{extras} I think we're also facing a big increase in the number and diversity of tablename values we're dealing with, as we integrate all of our former output & analysis tables into the DB directly. With the net generation allocation alone I ended up with assets that...

So we also need to think a bit about how we're going to come up with legible and concise names for the tables themselves.

Combining Skinny Derived Tables

The current work to translate our output layers into Dagster abstractions and the database isn't trying to change how we actually organize the processing, so that we have a limited scope to get the transition done. Looking forward, I think we might want to do some of that re-organization such that we are doing a better job of organizing new fact and dimension tables that are produced through analysis, and only combining these building blocks into wide, denormalized tables intended for direct use by most people at the end of the process.

Is there any way that we can prepare ourselves for doing that kind of disentanglement later without creating huge disruptions?

For example, right now the MCOE output table contains per-generator estimates of: capacity factor, heat rate, and fuel cost per MWh. Each of those variables has its own output method and there are dependencies between them, but the pudl_out.mcoe() method also brings in a large number of generator attributes and fills in missing per-generator timesteps, creating a very large denormalized dataframe. If the individual components are stored as separate fact tables in the database, they'll be easier to mix and match discretely when creating new assets. In the case of the monthly MCOE output, the full table has about 5.2M rows, but only 4.4M of them have capacity factor, 435K have heat rate, and 300K have a $/MWh estimate associated with them. This is because most generators don't have the data required to calculate heat rate or fuel cost, but we're creating complete monthly time-series for all of them and merging in a bunch of generator dimension table attributes which vary at most annually.

So right now we have to define a long & wide MCOE asset to preserve the current processing arrangement and expected outputs, and maybe that MCOE table is part of the data API that we highlight for users, but in the future the way it is constructed can change to keep the data warehouse better organized, without dramatically changing what the user sees?

I think a similar pattern may be useful in the cases where we are filling in / estimating missing values in the normalized fact tables (e.g. in the case of the EIA fuel price estimates in the FRC table). Being able to separately address the original facts and the augmented facts will help us label them appropriately and make it possible to use one or the other if need be in downstream applications.

I think this is mostly about having well-organized and modular tables within the data warehouse, as my impression is that most users don't want to think too hard about these individual components, and will go straight for the most complete denormalized tables for day-to-day usage, and in that context I think we'd probably end up with tables that look like "monthly generator facts" or "annual generator facts" that include all of the time varying values which are reported on the basis of generator ID, and which can easily be merged with any subset of the attributes which appear in the generator dimension table (generator entity, or generator annual). Maybe it also makes sense to just go ahead and do an exhaustive merge for folks so they have the utility (operator), plant, and generator information all there in the denormalized time-varying generator fact table, but that would be a monster table, with something like 200 columns. Is that normal / desirable?

bendnorman commented 1 year ago

Slowly Varying Dimensions

Oh I didn't realize how many of the fields in our annual tables are mostly static. There are three common methods for handling slowly changing dimensions in data warehouses. Type 2 is the only option that preserves the entire history of a given entity.

Type 2 adds a new row for each version of the entity by adding start and end dates for each record. A surrogate key is added to unique identify the record because the natural key will not be unique if there are multiple versions. The surrogate key is then used in the fact table.

After learning more about the star schema model, I don't think we're actually that close to following it. Maybe I'm interpreting the definitions too strictly but I think we'd have to change a lot about how we model our data to follow this pattern. For example, we'd have to move a bunch of columns from our annual tables, to our entity tables and add surrogate keys to follow the slowly changing dimension pattern. Based on what I've read, it isn't common for dimensions to reference each other. Some of our entity tables refer to other entity or encoding tables.

Instead of having multiple annual tables as fact tables, we'd probably have a smaller set of fact tables where each row is a report for the most granular entity (generators)?

We could spend some more hours thinking through how to adopt this common pattern for PUDL or we could just focus on desired outputs and naming conventions given our existing modeling.

Organizing Assets

I guess we're really trying to come up with an asset naming convention, with a subset of the assets also being database tables (and other assets possibly becoming database tables at some future date if they're clearly useful in the that context).

Yes! I think that's a good way to frame it.

I'm also excited about inverting our project structure. I think it will make it clearer to folks how to add new data sources: "To integrate a new datasource, add a directory to the data source, add some assets in a sub module and load the assets into the main Definition". Would we want the data source metadata to live in the datasource sub package or should it all live in the pudl.metadata.resources sub package?

Combining Skinny Derived Tables

WIP: Lots to think about! Will flesh out a response tomorrow!

bendnorman commented 1 year ago

It seems like we want to 1) define a naming convention and 2) decide how to structure the tables we want people to access. I’m not sure in what order we should tackle these. We could:

  1. Settle on a naming convention for our output tables and deprecate the PudlTabl class. Then we can give some more thought to how we model our data and the types of tables we want to distribute. Or...
  2. We rethink our data modeling prior to deciding on a naming convention and deprecating PudlTable.

Option one is probably the fasted path to distributing data as option 2 might require reworking a lot of our existing assets. However, option one will mean we might need to go through two iterations of name changes.

aesharpe commented 1 year ago

Wow there's a lot to this!

One hot take is that we shouldn't rename tables twice.

Also, have we come to any conclusions about where the data will live? (i.e. in one big happy PUDL db or in separate databases?). Ella mentioned this thread earlier, but I don't think we made any decisions about it. I feel like that might also impact how we decide to name these tables.

bendnorman commented 10 months ago

The full design doc can be found in this public google doc. The TL;DR was shared with users and the implementation will be tracked in #2765