catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Refactor FERC1 transform to separate out parameters and work with XBRL #1739

Closed cmgosnell closed 1 year ago

cmgosnell commented 1 year ago

Background

Defining transform params using Pydantic models

Column Transformers

Table Transformers

Composite Transformers

Partly derived from conversations in connection with #1721

zaneselvans commented 1 year ago

Notes on the classes and methods developed by @cmgosnell

TransformerMeta

GenericTransformer

PlantsSteamFerc1

FuelFerc1

zaneselvans commented 1 year ago

A few notes from playing with the generic XBRL extractor for the fuel and steam tables:

zaneselvans commented 1 year ago

Design notes

Parameter data structure

A nested data structure storing data structures used to instantiate Pydantic classes.

TABLE_TRANSFORM_PARAMS = {
   "fuel_ferc1": {  # Key identifies the table
        "rename_cols": {  # A table-level transform.
            "dbf": {
                "respondent_id": "utility_id_ferc1",
                "plant_name": "plant_name_ferc1",
                "fuel": "fuel_type_code_pudl",
                "fuel_unit": "fuel_units",
                "fuel_avg_heat": "fuel_btu_per_unit",
                "fuel_quantity": "fuel_consumed_units",
                "fuel_cost_burned": "fuel_cost_per_unit_burned",
                "fuel_cost_delvd": "fuel_cost_per_unit_delivered",
                "fuel_cost_btu": "fuel_cost_per_btu",
                "fuel_generaton": "fuel_btu_per_kwh",
            },
            "xbrl": {
                "PlantNameAxis": "plant_name_ferc1",
                "FuelKindAxis": "fuel_type_code_pudl",
                "FuelUnit": "fuel_units",
                "FuelBurnedAverageHeatContent": "fuel_btu_per_unit",
                "QuantityOfFuelBurned": "fuel_consumed_units",
                "AverageCostOfFuelPerUnitBurned": "fuel_cost_per_unit_burned",
                "AverageCostOfFuelPerUnitAsDelivered": "fuel_cost_per_unit_delivered",
                "AverageCostOfFuelBurnedPerMillionBritishThermalUnit": "fuel_cost_per_mmbtu",
                "AverageBritishThermalUnitPerKilowattHourNetGeneration": "fuel_btu_per_kwh",
                "AverageCostOfFuelBurnedPerKilowattHourNetGeneration": "fuel_cost_per_kwh",
            },
        },
        "unit_conversion": {  # A column-level transform
            "fuel_btu_per_unit": BTU_TO_MMBTU,  # Parameters to pass to the transform, on a per-column basis.
            "fuel_cost_per_kwh": PERKWH_TO_PERMWH,  # In this case they define a unit conversion.
            "fuel_btu_per_kwh": BTU_PERKWH_TO_MMBTU_PERMWH,
        },
        "string_categories": {
            "fuel_type_code_pudl": FUEL_TYPES,  # Here parameters are string cleaning dictionaries
            "fuel_units": FUEL_UNITS,
        },
        "simplify_strings": {
            "plant_name_ferc1": True,  # Booleans indicate that the transform applies to the column, but has no params.
            "fuel_type_code_pudl": True,
            "fuel_units": True,
        },
    },
}
zaneselvans commented 1 year ago

I've created an example of how these parts work together in a standalone module that's currently part of PR #1721. The example module itself is here.

Questions:

cmgosnell commented 1 year ago

How can the multi-column transform function factory be integrated into the TableTransformer classes, for cases in which there's a table-specific transformation that needs to happen? Can we use exactly the same code to turn single-column transformation methods into multi-column transformation methods? Is the Protocol setup for the column / multi-column / table transform functions reasonable? Can they be applied to class methods too? Should there be two (column + table) or three (column, multi-column, and table) interfaces? Rather than doing it using functions, is there a clean way to do it using methods in a Protocol class?

  • idk how to do this mechanically as a method but it would be great!

Right now all TransformParams are being specified in the module-level constant TRANSFORM_PARAMS. Is that also reasonable for parameters that pertain to transform methods which are only implemented inside the AbstractTableTransformer or in the concrete per-table implementations of that class? Where should the Pydantic classes defining TransformParams for those table-specific transform methods live? Having coupling between classes / methods that only exist inside of the TableTransformer classes and data structures / classes that are defined outside of it seems like it could be messy.

  • i'm in favor of storing all of the table/column transforms in the top-level dictionary for mostly readability. I think this would mean we would either need to define all of the top-level table transform parameters or have table-specific transform parameters where these bespoke params are defined. I think I would be in favor of the table-specific transform params. If this interface works well, it seems like we could refactor a lot of things in pudl.helpers to use it, and create a library of column, multi-column, and table-level transform functions and TransformParams classes.
  • Do you mean the specific helpers that the ferc tables use? For efficiency of getting the xbrl work done, I'd definitely like to prioritize getting through as much of the xbrl before diving into a big pudl-wide change. If we need to convert the widely used helpers we'll need to move the column. I was partly trying to demonstrate how things would be organized, but I think that in cases where there's an existing pandas method like df.rename() we should just use that to do the transformation, but store the parameters as TransformParams when they're interesting / important (like the column rename dictionaries).
  • couldn't agree more I don't like that for the plain dictionary TransformParams we still have to have a single attribute inside the Pydantic class (e.g. the columns in RenameColumns). We can make the root of the Pydantic model into a dict but the model doesn't automatically behave like a dict in that case -- you still have to add all the dict-like methods to it. getitem() and iter() were not sufficient to make it work with df.rename() so I gave up and went back to having a named attribute.
  • that's unfortunate... but also I don't know about implementation here to make this work for real. Do you seen any clear potential improvements to naming?
  • i don't love the lil tab or fn's personally. you changed apply to transform... which is a good improvement. I don't love love transform because it is almost always TableTransformerSomething.transform which feels repetitive. but it's FINE.
  • You may not love this but I also think it'd be good to name the transform method something different than the parameter. It feels just semi confusing have a self.rename_columns method and a self.params.rename_columns attribute. Maybe others will disagree but i've fond this name duplication for different types of things confusing.
  • I thiiink I'd vote for TableTransformerAbstract, TableTransformerSteam and TransformParamsAbstract, TransformParamsMultiColumn etc. just put the what it is generally first and what type of instance it is second. I could imagine just a ParamsAbstract, ParamsMultiColumn, ParamsRenameColumns etc. What TransformParams validations would make sense to add?
  • I don't know right now but I think this format gives us a really clear place to add validations in an iterative way. Currently there's no DatasetTransformParams model. TRANSFORM_PARAMS is just a dictionary keyed by table ID. I imagine there being one of these for each dataset (e.g. ferc1, eia923). Given that the keys are database table IDs, there's potential for important validations.
  • I think we should add a DatasetTransformParams model if and when it feels necessary/useful. Which I don't think it does rn. But it would be an extra wrapper around what is already here so I don't think it will be difficult to come back in and add it in. Similarly the multi-column transforms are currently just dictionaries mapping column names to TransformParams. What additional value / methods / validations / attributes could be added by defining another model to contain that information, and potentially differentiating between TableTransformParams and MultiColumnTransformParams (with the latter being a homogeneous composition of single-column TransformParams)
  • ditto to my dataset comment. I can definitely see this being useful, but I'd personally wait to develop the template until there is a clear need.
zaneselvans commented 1 year ago

Well, I intentionally made the parameter dictionary key exactly the same as the name of the function / method that it applies to. There's a 1-to-1 mapping between them and I think having the names be different will just be annoying as far as remembering exactly which one goes where or what's the verb vs. the noun etc. I think the context of "is this a function or is it parameters" will be pretty clear. But curious what other people think. If they're not identical, then we'll get different people with different ideas about how they should be named differently. Having them be identical means they are programmatically accessible without needing a dictionary to translate between the two sets of names. I was considering imposing a similar link between the function name and the name, like this_name to ThisName or ThisNameTransformParams

I've seen the fn convention used a lot of places in Python functional programming. I guess the alternative is func. Agree tab isn't great.

The naming conventions that I've most commonly seen in Python for classes and subclasses is that the base-class is the stem, and the modifier gets prefixed, which would yield a hierarchy like:

zaneselvans commented 1 year ago

This design work has been done, at least to a functional first draft level, in #1722 and #1721. I'll create another issue for refinements and separation of the generic infrastructure from the data source / table specific implementations: #1853