Refactor FERC1 transform to separate out parameters and work with XBRL

Background

Our early FERC 1 data transformations stored the parameters required for data transformations inside each table specific transformation function.
This led to poor standardization of the common processes across different tables, and made it hard to audit exactly what was happening and where.
Learning from this, we want to create a standardized structure that defines the most common transformations that we perform across many different FERC (and potentially other) tables.
The system needs to be able to adapt to working with Dagster as we migrate to orchestrating our ETL with that framework.
The parameters will be stored in a nested Pydantic data models, with one defined for each table, keyed by the name of the output table.
Transformations that need to be applied before concatenation of the XBRL and DBF derived data will probably need to be specified separately for the two original data sources.

Defining transform params using Pydantic models

We're going to split the transforms into two parts, an operation, and a set of parameters that control the operation.
The parameters will be stored in Pydantic models.
The operations will be either Python functions or class methods.
In addition each transformation will take input data that's going to get operated on, either a Series or a DataFrame
Each different type of transformation will have its own parameter model.
Higher level transformations may have parameter models which are composed of several simple parameter models.
Only the transformations that are being used in several cases and can be generalized will be handled this way. One-off operations that are specific to a particular table or column will be left as bespoke functions or methods which pertain only to that table or column, or done inline.
At the dataset (FERC Form 1) level, there will be a single nested data structure that contains all of the transformation parameters, allowing them to be looked up by table, column, and (where applicable) data source (DBF or XBRL).
This nested data structure can be used to instantiate individual or composite Pydantic objects, containing and validating the transformation parameters.

Column Transformers

A ColumnTransformer will take a single column (pd.Series) and return a single column.
It can change the name and/or contents of the column.
It will not change the length or index of the column.
It will also need a parameter object that controls the behavior of the transformation.
Several ColumnTransformer objects can be composed together to create a single TableTransformer. E.g. If there are several columns that need unit conversions, they could all be run together in a single table-level convert_units(df, TableUnitConversion) function.
In general column transforms will target a specific combination of table and column (i.e. just because a certain transformation is applied to net_generation_kwh in table 1 doesn't imply it will also get applied to net_generation_kwh in table 2).
[x] StringCategories: Manual string categorizations (to clean up e.g. freeform fuel types w/ cleanstrings())
[x] UnitConversion: Describes unit conversions, e.g. a column rename + multiplier used to convert from KWh to MWh.
[x] ValidRange Defines min and/or max allowable values outside of which the column will be set to NA (e.g. a construction year of 1294 or 3012). Used with e.g. oob_to_nan()
[x] simplify_strings(): Boolean indicates which columns should be simplified for matching purposes (e.g. plant and utility name)

Table Transformers

A TableTransformer will take a dataframe and return a dataframe (altered or new?)
It might only alter the contents of a single column, but could still require values from the other columns to do so.
It can alter the labels of columns as well as the contents.
It may alter the contents of multiple columns.
It can also remove or add columns or rows.
In addition to the dataframe to be altered, the transformer will need a parameter object that controls the behavior of the transformation.
[x] ~RowConsolidation: Record consolidation rules (when & how multiple records should be merged into a single row)~ removed because it was a special case we need to address more generally, applied to a column we're dropping.
[x] RenameColsFerc1: Column rename dictionaries (separate for xbrl & dbf data sources, so FERC 1 specific). This could also be defined as a collection of column transforms, but it's done to every single table, and specifying it as a single dict[str, str] is much more compact.
[ ] TableUnpacker: For reshaping from row- to column- based variables in the old FERC 1 data, and use with unpack_table() or a similar function.
[ ] ColumnCategories: For use with the cols_to_cats() function that gets applied to the FERC 1 DBF data.

Composite Transformers

[x] A table transformer that brings together all of a given type of column transformer in a table (e.g. all the units conversions)
[x] A table transformer that brings together all the other table transformers defined for a table

Partly derived from conversations in connection with #1721

Notes on the classes and methods developed by @cmgosnell

`TransformerMeta`

Class will be replaced by the nested Pydantic model structures directly
Rather than getting each transform parameter from a big structure and storing it as a class property, the parameters will be attributes of the class itself and be validated using their own more specific Pydantic models.
Need to think about where the table_name gets stored. It's the top level key, but will the whole thing be a Pydantic class, or is it it a collection of table-level classes that pertain to the FERC1 data source as a whole?
A few special functions:
- axis_cols_xbrl() gets per-filing PK (Axis) columns from XBRL rename dict.
- primary_keys_xbrl() combines the Axis cols with additional report_year and entity_id cols required to create a full PK in the SQLite DB context (across years, and across entities)

`GenericTransformer`

Understands what table it pertains to so it can look up appropriate transformation parameters.
Defines transformation methods that are applicable to multiple tables for re-use, but do all of the methods defined here apply to all tables? Or are there cases where there'd be no relevant parameters and the transformation would fail? Attempted application of an inapplicable transform should result in a no-op rather than failure.
Separates functionality previously contained in _clean_cols() to add a report_year and record_id.
Standardizes operations up to the point of concatenating the XBRL and DBF derived dataframes into a single dataframe. However, it refers to pre_concat_dbf() and pre_concat_xbrl() methods which are only defined in child classes, and which may not exist. Should be defined as abstract or no-op methods in the parent class if they must be defined.
Similarly does not provide a template for the execute() method or require that it exist in child classes.
Does not currently check the schema of the output dataframe (has a no-op passthrough)

`PlantsSteamFerc1`

Defines table specific functions that are required for the pre-concatenation function in the parent class to function properly. What, if anything, about these functions is really specific to the particular table?
For the "database like" tables in FERC 1, the pre-concat operations happening in the steam table seem like they're probably the right ones.
For other tables that involve row-unpacking and reshaping, we'll need to do something else entirely (since they end up having duplicated record_id value -- since many PUDL records are derived from a single FERC1 record. E.g. plant_in_service_ferc1 IIRC.
Defines some hairy methods that are specific to the steam table.

`FuelFerc1`

Implements a fuel-specific pre-concatenation method, which aggregates duplicate fuel records together. These duplicates are the result of our cleaning up the fuel_type column (which is part of the primary key) and imposing a controlled vocabulary. However, it looks like almost all of the operations taking place pre-concatenation are standard between fuel and steam (and probably many other tables?) So maybe a generic reusable pre-concat method can be provided, and either augmented or overridden instead of duplicating the code in every table-level pre-concatenation function?

A few notes from playing with the generic XBRL extractor for the fuel and steam tables:

Both the duration and instant tables include index and ReportYear columns, which are not part of the explicit primary key, but which do match. If they're left in the table they collide during the merge. Either they should become part of the merge key, or they should get dropped from one of the tables.
The filing_name column is a UUID, uniquely identifying the filing. If the same respondent amends their filing for a given year, I wonder if this is the only ID that will uniquely identify it, since the entity_id and plant names and dates would all be the same. Is it really safe to drop this column or should we hold on to it?
The index value appears to correspond roughly to the spplmnt_num from the DBF data, indicating which copy of the given page the record corresponds to. E.g. in the steam table, it increments for each new plant being reported. It seems like this should be retained in the "raw" merged data.
Given the possibility of multiple filings in a given time period by the same entity, the somewhat arbitrary primary key seems like it would be (filing_name, index)

Design notes

For the moment, we are still going to be passing around dictionaries of all dataframes.
When we re-factor to use Dagster we'll specify the individual table inputs/outputs dependencies.
Each table-level transform function or class will obtain its transform parameters from the Pydantic parameter classes.
After talking to @cmgosnell we think it makes more sense to compose the transformations hierarchically, with the table ID at the top level, the collection of transforms to apply to the table at the second level, and (in some cases) the set of columns to which column-level transforms should be applied at the bottom level. E.g. for the fuel_ferc1 table we might have.

Parameter data structure

A nested data structure storing data structures used to instantiate Pydantic classes.

TABLE_TRANSFORM_PARAMS = {
   "fuel_ferc1": {  # Key identifies the table
        "rename_cols": {  # A table-level transform.
            "dbf": {
                "respondent_id": "utility_id_ferc1",
                "plant_name": "plant_name_ferc1",
                "fuel": "fuel_type_code_pudl",
                "fuel_unit": "fuel_units",
                "fuel_avg_heat": "fuel_btu_per_unit",
                "fuel_quantity": "fuel_consumed_units",
                "fuel_cost_burned": "fuel_cost_per_unit_burned",
                "fuel_cost_delvd": "fuel_cost_per_unit_delivered",
                "fuel_cost_btu": "fuel_cost_per_btu",
                "fuel_generaton": "fuel_btu_per_kwh",
            },
            "xbrl": {
                "PlantNameAxis": "plant_name_ferc1",
                "FuelKindAxis": "fuel_type_code_pudl",
                "FuelUnit": "fuel_units",
                "FuelBurnedAverageHeatContent": "fuel_btu_per_unit",
                "QuantityOfFuelBurned": "fuel_consumed_units",
                "AverageCostOfFuelPerUnitBurned": "fuel_cost_per_unit_burned",
                "AverageCostOfFuelPerUnitAsDelivered": "fuel_cost_per_unit_delivered",
                "AverageCostOfFuelBurnedPerMillionBritishThermalUnit": "fuel_cost_per_mmbtu",
                "AverageBritishThermalUnitPerKilowattHourNetGeneration": "fuel_btu_per_kwh",
                "AverageCostOfFuelBurnedPerKilowattHourNetGeneration": "fuel_cost_per_kwh",
            },
        },
        "unit_conversion": {  # A column-level transform
            "fuel_btu_per_unit": BTU_TO_MMBTU,  # Parameters to pass to the transform, on a per-column basis.
            "fuel_cost_per_kwh": PERKWH_TO_PERMWH,  # In this case they define a unit conversion.
            "fuel_btu_per_kwh": BTU_PERKWH_TO_MMBTU_PERMWH,
        },
        "string_categories": {
            "fuel_type_code_pudl": FUEL_TYPES,  # Here parameters are string cleaning dictionaries
            "fuel_units": FUEL_UNITS,
        },
        "simplify_strings": {
            "plant_name_ferc1": True,  # Booleans indicate that the transform applies to the column, but has no params.
            "fuel_type_code_pudl": True,
            "fuel_units": True,
        },
    },
}

I've created an example of how these parts work together in a standalone module that's currently part of PR #1721. The example module itself is here.

Questions:

How can the multi-column transform function factory be integrated into the TableTransformer classes, for cases in which there's a table-specific transformation that needs to happen? Can we use exactly the same code to turn single-column transformation methods into multi-column transformation methods?
Is the Protocol setup for the column / multi-column / table transform functions reasonable? Can they be applied to class methods too? Should there be two (column + table) or three (column, multi-column, and table) interfaces? Rather than doing it using functions, is there a clean way to do it using methods in a Protocol class?
Right now all TransformParams are being specified in the module-level constant TRANSFORM_PARAMS. Is that also reasonable for parameters that pertain to transform methods which are only implemented inside the AbstractTableTransformer or in the concrete per-table implementations of that class? Where should the Pydantic classes defining TransformParams for those table-specific transform methods live? Having coupling between classes / methods that only exist inside of the TableTransformer classes and data structures / classes that are defined outside of it seems like it could be messy.
If this interface works well, it seems like we could refactor a lot of things in pudl.helpers to use it, and create a library of column, multi-column, and table-level transform functions and TransformParams classes.
I was partly trying to demonstrate how things would be organized, but I think that in cases where there's an existing pandas method like df.rename() we should just use that to do the transformation, but store the parameters as TransformParams when they're interesting / important (like the column rename dictionaries).
I don't like that for the plain dictionary TransformParams we still have to have a single attribute inside the Pydantic class (e.g. the columns in RenameColumns). We can make the __root__ of the Pydantic model into a dict but the model doesn't automatically behave like a dict in that case -- you still have to add all the dict-like methods to it. __getitem__() and __iter__() were not sufficient to make it work with df.rename() so I gave up and went back to having a named attribute.
Do you seen any clear potential improvements to naming?
What TransformParams validations would make sense to add?
Currently there's no DatasetTransformParams model. TRANSFORM_PARAMS is just a dictionary keyed by table ID. I imagine there being one of these for each dataset (e.g. ferc1, eia923). Given that the keys are database table IDs, there's potential for important validations.
Similarly the multi-column transforms are currently just dictionaries mapping column names to TransformParams. What additional value / methods / validations / attributes could be added by defining another model to contain that information, and potentially differentiating between TableTransformParams and MultiColumnTransformParams (with the latter being a homogeneous composition of single-column TransformParams)

How can the multi-column transform function factory be integrated into the TableTransformer classes, for cases in which there's a table-specific transformation that needs to happen? Can we use exactly the same code to turn single-column transformation methods into multi-column transformation methods? Is the Protocol setup for the column / multi-column / table transform functions reasonable? Can they be applied to class methods too? Should there be two (column + table) or three (column, multi-column, and table) interfaces? Rather than doing it using functions, is there a clean way to do it using methods in a Protocol class?

idk how to do this mechanically as a method but it would be great!

Right now all TransformParams are being specified in the module-level constant TRANSFORM_PARAMS. Is that also reasonable for parameters that pertain to transform methods which are only implemented inside the AbstractTableTransformer or in the concrete per-table implementations of that class? Where should the Pydantic classes defining TransformParams for those table-specific transform methods live? Having coupling between classes / methods that only exist inside of the TableTransformer classes and data structures / classes that are defined outside of it seems like it could be messy.

i'm in favor of storing all of the table/column transforms in the top-level dictionary for mostly readability. I think this would mean we would either need to define all of the top-level table transform parameters or have table-specific transform parameters where these bespoke params are defined. I think I would be in favor of the table-specific transform params. If this interface works well, it seems like we could refactor a lot of things in pudl.helpers to use it, and create a library of column, multi-column, and table-level transform functions and TransformParams classes.

Do you mean the specific helpers that the ferc tables use? For efficiency of getting the xbrl work done, I'd definitely like to prioritize getting through as much of the xbrl before diving into a big pudl-wide change. If we need to convert the widely used helpers we'll need to move the column. I was partly trying to demonstrate how things would be organized, but I think that in cases where there's an existing pandas method like df.rename() we should just use that to do the transformation, but store the parameters as TransformParams when they're interesting / important (like the column rename dictionaries).

couldn't agree more I don't like that for the plain dictionary TransformParams we still have to have a single attribute inside the Pydantic class (e.g. the columns in RenameColumns). We can make the root of the Pydantic model into a dict but the model doesn't automatically behave like a dict in that case -- you still have to add all the dict-like methods to it. getitem() and iter() were not sufficient to make it work with df.rename() so I gave up and went back to having a named attribute.

that's unfortunate... but also I don't know about implementation here to make this work for real. Do you seen any clear potential improvements to naming?

i don't love the lil tab or fn's personally. you changed apply to transform... which is a good improvement. I don't love love transform because it is almost always TableTransformerSomething.transform which feels repetitive. but it's FINE.

You may not love this but I also think it'd be good to name the transform method something different than the parameter. It feels just semi confusing have a self.rename_columns method and a self.params.rename_columns attribute. Maybe others will disagree but i've fond this name duplication for different types of things confusing.

I thiiink I'd vote for TableTransformerAbstract, TableTransformerSteam and TransformParamsAbstract, TransformParamsMultiColumn etc. just put the what it is generally first and what type of instance it is second. I could imagine just a ParamsAbstract, ParamsMultiColumn, ParamsRenameColumns etc. What TransformParams validations would make sense to add?

I don't know right now but I think this format gives us a really clear place to add validations in an iterative way. Currently there's no DatasetTransformParams model. TRANSFORM_PARAMS is just a dictionary keyed by table ID. I imagine there being one of these for each dataset (e.g. ferc1, eia923). Given that the keys are database table IDs, there's potential for important validations.

I think we should add a DatasetTransformParams model if and when it feels necessary/useful. Which I don't think it does rn. But it would be an extra wrapper around what is already here so I don't think it will be difficult to come back in and add it in. Similarly the multi-column transforms are currently just dictionaries mapping column names to TransformParams. What additional value / methods / validations / attributes could be added by defining another model to contain that information, and potentially differentiating between TableTransformParams and MultiColumnTransformParams (with the latter being a homogeneous composition of single-column TransformParams)

ditto to my dataset comment. I can definitely see this being useful, but I'd personally wait to develop the template until there is a clear need.

Well, I intentionally made the parameter dictionary key exactly the same as the name of the function / method that it applies to. There's a 1-to-1 mapping between them and I think having the names be different will just be annoying as far as remembering exactly which one goes where or what's the verb vs. the noun etc. I think the context of "is this a function or is it parameters" will be pretty clear. But curious what other people think. If they're not identical, then we'll get different people with different ideas about how they should be named differently. Having them be identical means they are programmatically accessible without needing a dictionary to translate between the two sets of names. I was considering imposing a similar link between the function name and the name, like this_name to ThisName or ThisNameTransformParams

I've seen the fn convention used a lot of places in Python functional programming. I guess the alternative is func. Agree tab isn't great.

The naming conventions that I've most commonly seen in Python for classes and subclasses is that the base-class is the stem, and the modifier gets prefixed, which would yield a hierarchy like:

AbstractTableTransformer
- Ferc1TableTransformer
- FuelFerc1TableTransformer
- PlantsSteamFerc1TableTransformer
- PlantsSmallFerc1TableTransformer
- ...

This design work has been done, at least to a functional first draft level, in #1722 and #1721. I'll create another issue for refinements and separation of the generic infrastructure from the data source / table specific implementations: #1853

catalyst-cooperative / pudl