Closed cmgosnell closed 1 year ago
Notes on the classes and methods developed by @cmgosnell
TransformerMeta
table_name
gets stored. It's the top level key, but will the whole thing be a Pydantic class, or is it it a collection of table-level classes that pertain to the FERC1 data source as a whole?axis_cols_xbrl()
gets per-filing PK (Axis) columns from XBRL rename dict.primary_keys_xbrl()
combines the Axis cols with additional report_year
and entity_id
cols required to create a full PK in the SQLite DB context (across years, and across entities)GenericTransformer
_clean_cols()
to add a report_year
and record_id
.pre_concat_dbf()
and pre_concat_xbrl()
methods which are only defined in child classes, and which may not exist. Should be defined as abstract or no-op methods in the parent class if they must be defined.execute()
method or require that it exist in child classes.PlantsSteamFerc1
record_id
value -- since many PUDL records are derived from a single FERC1 record. E.g. plant_in_service_ferc1
IIRC.FuelFerc1
fuel_type
column (which is part of the primary key) and imposing a controlled vocabulary. However, it looks like almost all of the operations taking place pre-concatenation are standard between fuel and steam (and probably many other tables?) So maybe a generic reusable pre-concat method can be provided, and either augmented or overridden instead of duplicating the code in every table-level pre-concatenation function?A few notes from playing with the generic XBRL extractor for the fuel and steam tables:
duration
and instant
tables include index
and ReportYear
columns, which are not part of the explicit primary key, but which do match. If they're left in the table they collide during the merge. Either they should become part of the merge key, or they should get dropped from one of the tables.filing_name
column is a UUID, uniquely identifying the filing. If the same respondent amends their filing for a given year, I wonder if this is the only ID that will uniquely identify it, since the entity_id
and plant names and dates would all be the same. Is it really safe to drop this column or should we hold on to it?index
value appears to correspond roughly to the spplmnt_num
from the DBF data, indicating which copy of the given page the record corresponds to. E.g. in the steam table, it increments for each new plant being reported. It seems like this should be retained in the "raw" merged data.(filing_name, index)
fuel_ferc1
table we might have.A nested data structure storing data structures used to instantiate Pydantic classes.
TABLE_TRANSFORM_PARAMS = {
"fuel_ferc1": { # Key identifies the table
"rename_cols": { # A table-level transform.
"dbf": {
"respondent_id": "utility_id_ferc1",
"plant_name": "plant_name_ferc1",
"fuel": "fuel_type_code_pudl",
"fuel_unit": "fuel_units",
"fuel_avg_heat": "fuel_btu_per_unit",
"fuel_quantity": "fuel_consumed_units",
"fuel_cost_burned": "fuel_cost_per_unit_burned",
"fuel_cost_delvd": "fuel_cost_per_unit_delivered",
"fuel_cost_btu": "fuel_cost_per_btu",
"fuel_generaton": "fuel_btu_per_kwh",
},
"xbrl": {
"PlantNameAxis": "plant_name_ferc1",
"FuelKindAxis": "fuel_type_code_pudl",
"FuelUnit": "fuel_units",
"FuelBurnedAverageHeatContent": "fuel_btu_per_unit",
"QuantityOfFuelBurned": "fuel_consumed_units",
"AverageCostOfFuelPerUnitBurned": "fuel_cost_per_unit_burned",
"AverageCostOfFuelPerUnitAsDelivered": "fuel_cost_per_unit_delivered",
"AverageCostOfFuelBurnedPerMillionBritishThermalUnit": "fuel_cost_per_mmbtu",
"AverageBritishThermalUnitPerKilowattHourNetGeneration": "fuel_btu_per_kwh",
"AverageCostOfFuelBurnedPerKilowattHourNetGeneration": "fuel_cost_per_kwh",
},
},
"unit_conversion": { # A column-level transform
"fuel_btu_per_unit": BTU_TO_MMBTU, # Parameters to pass to the transform, on a per-column basis.
"fuel_cost_per_kwh": PERKWH_TO_PERMWH, # In this case they define a unit conversion.
"fuel_btu_per_kwh": BTU_PERKWH_TO_MMBTU_PERMWH,
},
"string_categories": {
"fuel_type_code_pudl": FUEL_TYPES, # Here parameters are string cleaning dictionaries
"fuel_units": FUEL_UNITS,
},
"simplify_strings": {
"plant_name_ferc1": True, # Booleans indicate that the transform applies to the column, but has no params.
"fuel_type_code_pudl": True,
"fuel_units": True,
},
},
}
I've created an example of how these parts work together in a standalone module that's currently part of PR #1721. The example module itself is here.
TableTransformer
classes, for cases in which there's a table-specific transformation that needs to happen? Can we use exactly the same code to turn single-column transformation methods into multi-column transformation methods?Protocol
setup for the column / multi-column / table transform functions reasonable? Can they be applied to class methods too? Should there be two (column + table) or three (column, multi-column, and table) interfaces? Rather than doing it using functions, is there a clean way to do it using methods in a Protocol
class?TransformParams
are being specified in the module-level constant TRANSFORM_PARAMS
. Is that also reasonable for parameters that pertain to transform methods which are only implemented inside the AbstractTableTransformer
or in the concrete per-table implementations of that class? Where should the Pydantic classes defining TransformParams
for those table-specific transform methods live? Having coupling between classes / methods that only exist inside of the TableTransformer
classes and data structures / classes that are defined outside of it seems like it could be messy.pudl.helpers
to use it, and create a library of column, multi-column, and table-level transform functions and TransformParams
classes.df.rename()
we should just use that to do the transformation, but store the parameters as TransformParams
when they're interesting / important (like the column rename dictionaries).TransformParams
we still have to have a single attribute inside the Pydantic class (e.g. the columns
in RenameColumns
). We can make the __root__
of the Pydantic model into a dict
but the model doesn't automatically behave like a dict in that case -- you still have to add all the dict-like methods to it. __getitem__()
and __iter__()
were not sufficient to make it work with df.rename()
so I gave up and went back to having a named attribute.TransformParams
validations would make sense to add?DatasetTransformParams
model. TRANSFORM_PARAMS
is just a dictionary keyed by table ID. I imagine there being one of these for each dataset (e.g. ferc1, eia923). Given that the keys are database table IDs, there's potential for important validations.TransformParams
. What additional value / methods / validations / attributes could be added by defining another model to contain that information, and potentially differentiating between TableTransformParams
and MultiColumnTransformParams
(with the latter being a homogeneous composition of single-column TransformParams
)How can the multi-column transform function factory be integrated into the TableTransformer classes, for cases in which there's a table-specific transformation that needs to happen? Can we use exactly the same code to turn single-column transformation methods into multi-column transformation methods? Is the Protocol setup for the column / multi-column / table transform functions reasonable? Can they be applied to class methods too? Should there be two (column + table) or three (column, multi-column, and table) interfaces? Rather than doing it using functions, is there a clean way to do it using methods in a Protocol class?
- idk how to do this mechanically as a method but it would be great!
Right now all TransformParams are being specified in the module-level constant TRANSFORM_PARAMS. Is that also reasonable for parameters that pertain to transform methods which are only implemented inside the AbstractTableTransformer or in the concrete per-table implementations of that class? Where should the Pydantic classes defining TransformParams for those table-specific transform methods live? Having coupling between classes / methods that only exist inside of the TableTransformer classes and data structures / classes that are defined outside of it seems like it could be messy.
- i'm in favor of storing all of the table/column transforms in the top-level dictionary for mostly readability. I think this would mean we would either need to define all of the top-level table transform parameters or have table-specific transform parameters where these bespoke params are defined. I think I would be in favor of the table-specific transform params. If this interface works well, it seems like we could refactor a lot of things in pudl.helpers to use it, and create a library of column, multi-column, and table-level transform functions and TransformParams classes.
- Do you mean the specific helpers that the ferc tables use? For efficiency of getting the xbrl work done, I'd definitely like to prioritize getting through as much of the xbrl before diving into a big pudl-wide change. If we need to convert the widely used helpers we'll need to move the column. I was partly trying to demonstrate how things would be organized, but I think that in cases where there's an existing pandas method like df.rename() we should just use that to do the transformation, but store the parameters as TransformParams when they're interesting / important (like the column rename dictionaries).
- couldn't agree more I don't like that for the plain dictionary TransformParams we still have to have a single attribute inside the Pydantic class (e.g. the columns in RenameColumns). We can make the root of the Pydantic model into a dict but the model doesn't automatically behave like a dict in that case -- you still have to add all the dict-like methods to it. getitem() and iter() were not sufficient to make it work with df.rename() so I gave up and went back to having a named attribute.
- that's unfortunate... but also I don't know about implementation here to make this work for real. Do you seen any clear potential improvements to naming?
- i don't love the lil
tab
orfn
's personally. you changedapply
totransform
... which is a good improvement. I don't love lovetransform
because it is almost alwaysTableTransformerSomething.transform
which feels repetitive. but it's FINE.- You may not love this but I also think it'd be good to name the transform method something different than the parameter. It feels just semi confusing have a
self.rename_columns
method and aself.params.rename_columns
attribute. Maybe others will disagree but i've fond this name duplication for different types of things confusing.- I thiiink I'd vote for
TableTransformerAbstract
,TableTransformerSteam
andTransformParamsAbstract
,TransformParamsMultiColumn
etc. just put the what it is generally first and what type of instance it is second. I could imagine just aParamsAbstract
,ParamsMultiColumn
,ParamsRenameColumns
etc. What TransformParams validations would make sense to add?- I don't know right now but I think this format gives us a really clear place to add validations in an iterative way. Currently there's no DatasetTransformParams model. TRANSFORM_PARAMS is just a dictionary keyed by table ID. I imagine there being one of these for each dataset (e.g. ferc1, eia923). Given that the keys are database table IDs, there's potential for important validations.
- I think we should add a
DatasetTransformParams
model if and when it feels necessary/useful. Which I don't think it does rn. But it would be an extra wrapper around what is already here so I don't think it will be difficult to come back in and add it in. Similarly the multi-column transforms are currently just dictionaries mapping column names to TransformParams. What additional value / methods / validations / attributes could be added by defining another model to contain that information, and potentially differentiating between TableTransformParams and MultiColumnTransformParams (with the latter being a homogeneous composition of single-column TransformParams)- ditto to my dataset comment. I can definitely see this being useful, but I'd personally wait to develop the template until there is a clear need.
Well, I intentionally made the parameter dictionary key exactly the same as the name of the function / method that it applies to. There's a 1-to-1 mapping between them and I think having the names be different will just be annoying as far as remembering exactly which one goes where or what's the verb vs. the noun etc. I think the context of "is this a function or is it parameters" will be pretty clear. But curious what other people think. If they're not identical, then we'll get different people with different ideas about how they should be named differently. Having them be identical means they are programmatically accessible without needing a dictionary to translate between the two sets of names. I was considering imposing a similar link between the function name and the name, like this_name
to ThisName
or ThisNameTransformParams
I've seen the fn
convention used a lot of places in Python functional programming. I guess the alternative is func
. Agree tab
isn't great.
The naming conventions that I've most commonly seen in Python for classes and subclasses is that the base-class is the stem, and the modifier gets prefixed, which would yield a hierarchy like:
AbstractTableTransformer
Ferc1TableTransformer
FuelFerc1TableTransformer
PlantsSteamFerc1TableTransformer
PlantsSmallFerc1TableTransformer
This design work has been done, at least to a functional first draft level, in #1722 and #1721. I'll create another issue for refinements and separation of the generic infrastructure from the data source / table specific implementations: #1853
Background
Defining transform params using Pydantic models
Series
or aDataFrame
Column Transformers
pd.Series
) and return a single column.TableTransformer
. E.g. If there are several columns that need unit conversions, they could all be run together in a single table-levelconvert_units(df, TableUnitConversion)
function.net_generation_kwh
in table 1 doesn't imply it will also get applied tonet_generation_kwh
in table 2).StringCategories
: Manual string categorizations (to clean up e.g. freeform fuel types w/cleanstrings()
)UnitConversion
: Describes unit conversions, e.g. a column rename + multiplier used to convert from KWh to MWh.ValidRange
Defines min and/or max allowable values outside of which the column will be set to NA (e.g. a construction year of 1294 or 3012). Used with e.g.oob_to_nan()
simplify_strings()
: Boolean indicates which columns should be simplified for matching purposes (e.g. plant and utility name)Table Transformers
RowConsolidation
: Record consolidation rules (when & how multiple records should be merged into a single row)~ removed because it was a special case we need to address more generally, applied to a column we're dropping.RenameColsFerc1
: Column rename dictionaries (separate for xbrl & dbf data sources, so FERC 1 specific). This could also be defined as a collection of column transforms, but it's done to every single table, and specifying it as a singledict[str, str]
is much more compact.TableUnpacker
: For reshaping from row- to column- based variables in the old FERC 1 data, and use withunpack_table()
or a similar function.ColumnCategories
: For use with thecols_to_cats()
function that gets applied to the FERC 1 DBF data.Composite Transformers
Partly derived from conversations in connection with #1721