Closed bendnorman closed 10 months ago
Some of the discussion in #2275 should be relevant here. Right now IIRC the normalized tables are the only tables which don't have some kind of qualifying suffix (e.g. the referent boiler_fuel_eia923
table is the normalized one).
In terms of updating naming conventions, the name should convey the following four pieces of information:
plant_parts_eia
).While doing this, we should aim for the following:
boiler_fuel_eia923_raw
, boiler_fuel_eia923_clean
, boiler_fuel_eia923_norm
, boiler_fuel_eia923_denorm
)I'd propose the following, which is a bit of a departure from our current naming protocol in the dag but I think would be clearest:
{tablename}_{source}_{processing_stage}_{extras}
e.g. plants_steam_ferc1_denorm
or generation_eia923_denorm_annual
Pros of this approach:
clean_boiler_fuel_eia923
)boiler_fuel_eia923
is the normalized or denormalized tableOne amendment to the above: the denormalized tables are the ones that will be most widely used. We'll want to use the simplest and clearest naming convention for them, so this may mean we want to drop the denorm
suffix here (rather than what we currently do, which is drop it for the normalized tables).
This seems like an accurate assessment of our current naming issues @e-belfer; thank you! I agree our "normalized" tables are the only tables / assets lacking a suffix that describes the stage of the data.
Our naming conventions will also affect how we categorize our assets, code, tables and metadata. There are a few ways we are grouping assets:
{processing_stage}_{source}
naming convention. There are a few that don't: glue, eia_bulk_elec, epacems, static. The assets in the groups that use the {processing_stage}_{source}
are loaded from modules where the parent module is the processing_stage
and the source
is the module name. For example, assets in pudl.extract.eia860
are a part of the raw_eia860
asset group and assets in pudl.transform.eia860
are a part of the clean_eia860
group. The assets that don't use the {processing_stage}_{source}
naming convention live in pudl.etl.{source}_assets
modules. These _assets
modules feel like an awkward deviation from how most of our assets are organized.Package.etl_group
attribute is sometimes used to gather asset names from the metadata for multi_assets. For example: https://github.com/catalyst-cooperative/pudl/blob/f64833876cb047ce85069678233bdd3e090925b0/src/pudl/etl/glue_assets.py#L18-L27
I think etl_group
was used in a previous version of the ETL and was coopted in the dagster conversion to dynamically pull table names from the metadata for multi_asset
decorators that write dataframes to the database. There are other multi_assets like pudl.extract.eia860.extract_eia860()
that store the returned asset names in a module-level tuple. We should decide what do with etl_group
! Should we just use the asset group of each asset instead?{tablename}_{source}_{processing_stage}_{extras}
.I could see us adopting a file structure organized by processing stage and source:
.
└── src/
└── pudl/
├── eia/
│ ├── eia860/
│ │ ├── raw.py
│ │ └── clean.py
│ ├── eia923/
│ │ ├── raw.py
│ │ └── clean.py
│ ├── norm.py
│ └── denorm.py
├── ferc/
│ ├── ferc1/
│ │ ├── raw.py
│ │ └── norm.py
│ ├── denorm.py
│ └── ...
├── epa/
│ └── ...
└── glue/
└── ...
We should base our table naming convention on standard data modeling techniques. Here are a few questions we should answer:
I think our normalized data roughly follows the popular Star Schema. The star schema organizes data into fact and dimension tables.
The fact table in a star schema contains the measures or metrics that are of interest to the user or organization. For example, in a sales data warehouse, the fact table might contain sales revenue, units sold, and profit margins. Each record in the fact table represents a specific event or transaction, such as a sale or order.
The dimension tables in a star schema contain the descriptive attributes of the measures in the fact table. These attributes are used to slice and dice the data in the fact table, allowing users to analyze the data from different perspectives. For example, in a sales data warehouse, the dimension tables might include product, customer, time, and location.
We have tables where each row measures a specific event (a report from a generator, plant, utility, or respondent), similar to a fact table. Our entity tables contain attributes that describe the events in the fact tables, similar to dimension tables. For example, the generators_eia860
table contains annually varying generator attributes (fact) and the generators_entity_eia
contains static generator attributes (dimension). Most FERC Form 1 tables are fact tables where each row is a report and the utilities_ferc1
is a dimension table that describes the utilities in the reports.
These fact and dimension tables can then be combined to create various derived tables that summarize and aggregate the data. I still need to find more examples/documentation on derived tables to recommend naming conventions for all of our "denormalized" tables.
I'm trying to categorize our existing tables to confirm they map to star schema concepts. We discussed how these topics apply to PUDL in a github discussion from September 2022.
A lot of our dimensons are slowly changing. More dimension-like than fact-like, but not truly static, so we end up with almost all of the generator / boiler / plant attributes showing up in the annual rather then entity tables. How do those kinds of tables typically fit into data warehouse schemas?
I guess we're really trying to come up with an asset naming convention, with a subset of the assets also being database tables (and other assets possibly becoming database tables at some future date if they're clearly useful in the that context).
What is the intended method of organizing / grouping assets in Dagster? Initially I had thought that the asset_group
was going to be flexible and easy to use, but it seems like it's mostly intended to be a coarse DAG visualization tag, and the tooling is more robust around module & subpackage based groupings, which can be hierarchically organized and are easy to select groups of assets from. This also seems to be in line with Dagster's recommended project layout. I think I may have been confused about how assets are supposed to be accessed and organized by their fully featured project example, which has 3 asset groups that map directly to 3 subpackages, with no more complicated nesting or relationships. If it's going to be easiest to work with and refer to assets based on the module / subpackage that they're defined within then the kind of data source and processing stage hierarchy that Ben outlined above seems desirable. At least in my work so far I've much more frequently been trying to work with groups of assets within a particular data source end-to-end rather than a particular processing stage. This would invert our current module hierarchy, but that's a change I've wanted to make for a long time, in part because many of our helpers / tooling is dataset specific and this would allow us to keep e.g. the DBF extraction code close to the FERC data and EIA spreadsheet specific cleaning functions close to the EIA data. I think we also have more natural hierarchies within the data sources than within the linear cleaning pipelines. So we have a bunch of EIA or FERC data sources which can be grouped together and share many quirks.
Alongside the 4 naming chunks in {tablename}_{source}_{processing_stage}_{extras}
I think we're also facing a big increase in the number and diversity of tablename
values we're dealing with, as we integrate all of our former output & analysis tables into the DB directly. With the net generation allocation alone I ended up with assets that...
So we also need to think a bit about how we're going to come up with legible and concise names for the tables themselves.
The current work to translate our output layers into Dagster abstractions and the database isn't trying to change how we actually organize the processing, so that we have a limited scope to get the transition done. Looking forward, I think we might want to do some of that re-organization such that we are doing a better job of organizing new fact and dimension tables that are produced through analysis, and only combining these building blocks into wide, denormalized tables intended for direct use by most people at the end of the process.
Is there any way that we can prepare ourselves for doing that kind of disentanglement later without creating huge disruptions?
For example, right now the MCOE output table contains per-generator estimates of: capacity factor, heat rate, and fuel cost per MWh. Each of those variables has its own output method and there are dependencies between them, but the pudl_out.mcoe()
method also brings in a large number of generator attributes and fills in missing per-generator timesteps, creating a very large denormalized dataframe. If the individual components are stored as separate fact tables in the database, they'll be easier to mix and match discretely when creating new assets. In the case of the monthly MCOE output, the full table has about 5.2M rows, but only 4.4M of them have capacity factor, 435K have heat rate, and 300K have a $/MWh estimate associated with them. This is because most generators don't have the data required to calculate heat rate or fuel cost, but we're creating complete monthly time-series for all of them and merging in a bunch of generator dimension table attributes which vary at most annually.
So right now we have to define a long & wide MCOE asset to preserve the current processing arrangement and expected outputs, and maybe that MCOE table is part of the data API that we highlight for users, but in the future the way it is constructed can change to keep the data warehouse better organized, without dramatically changing what the user sees?
I think a similar pattern may be useful in the cases where we are filling in / estimating missing values in the normalized fact tables (e.g. in the case of the EIA fuel price estimates in the FRC table). Being able to separately address the original facts and the augmented facts will help us label them appropriately and make it possible to use one or the other if need be in downstream applications.
I think this is mostly about having well-organized and modular tables within the data warehouse, as my impression is that most users don't want to think too hard about these individual components, and will go straight for the most complete denormalized tables for day-to-day usage, and in that context I think we'd probably end up with tables that look like "monthly generator facts" or "annual generator facts" that include all of the time varying values which are reported on the basis of generator ID, and which can easily be merged with any subset of the attributes which appear in the generator dimension table (generator entity, or generator annual). Maybe it also makes sense to just go ahead and do an exhaustive merge for folks so they have the utility (operator), plant, and generator information all there in the denormalized time-varying generator fact table, but that would be a monster table, with something like 200 columns. Is that normal / desirable?
Oh I didn't realize how many of the fields in our annual tables are mostly static. There are three common methods for handling slowly changing dimensions in data warehouses. Type 2 is the only option that preserves the entire history of a given entity.
Type 2 adds a new row for each version of the entity by adding start and end dates for each record. A surrogate key is added to unique identify the record because the natural key will not be unique if there are multiple versions. The surrogate key is then used in the fact table.
After learning more about the star schema model, I don't think we're actually that close to following it. Maybe I'm interpreting the definitions too strictly but I think we'd have to change a lot about how we model our data to follow this pattern. For example, we'd have to move a bunch of columns from our annual tables, to our entity tables and add surrogate keys to follow the slowly changing dimension pattern. Based on what I've read, it isn't common for dimensions to reference each other. Some of our entity tables refer to other entity or encoding tables.
Instead of having multiple annual tables as fact tables, we'd probably have a smaller set of fact tables where each row is a report for the most granular entity (generators)?
We could spend some more hours thinking through how to adopt this common pattern for PUDL or we could just focus on desired outputs and naming conventions given our existing modeling.
I guess we're really trying to come up with an asset naming convention, with a subset of the assets also being database tables (and other assets possibly becoming database tables at some future date if they're clearly useful in the that context).
Yes! I think that's a good way to frame it.
I'm also excited about inverting our project structure. I think it will make it clearer to folks how to add new data sources: "To integrate a new datasource, add a directory to the data source, add some assets in a sub module and load the assets into the main Definition
". Would we want the data source metadata to live in the datasource sub package or should it all live in the pudl.metadata.resources
sub package?
WIP: Lots to think about! Will flesh out a response tomorrow!
It seems like we want to 1) define a naming convention and 2) decide how to structure the tables we want people to access. I’m not sure in what order we should tackle these. We could:
Option one is probably the fasted path to distributing data as option 2 might require reworking a lot of our existing assets. However, option one will mean we might need to go through two iterations of name changes.
Wow there's a lot to this!
One hot take is that we shouldn't rename tables twice.
Also, have we come to any conclusions about where the data will live? (i.e. in one big happy PUDL db or in separate databases?). Ella mentioned this thread earlier, but I don't think we made any decisions about it. I feel like that might also impact how we decide to name these tables.
The full design doc can be found in this public google doc. The TL;DR was shared with users and the implementation will be tracked in #2765
It sounds like users favor ripping out
PudlTabl
ASAP and accessing the table directly from the database #2503. Given we won't havePudlTabl
method to act as aliases for the actual table names, we need to decide on a naming convention for our denormalized tables. We'll likely have to update the normalized table names because the ideal denormalized table names might conflict with the existing normalized table names. This is a scary change!Questions