catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Generalize process for standardizing and remapping code and label columns #1271

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

I found a bunch of foreign key violations while compiling and standardizing the labeling tables (see #1252) , which isn't surprising, since we're linking up these coded columns with their home tables / canonical values for the first time and enforcing the relationship. The current arrangement is pretty ad-hoc, and I think we can take advantage of the new metadata setup to make it much, much better. Both more complete and easier to maintain and automatically document.

Current situation

A slight improvement

def standardize_codes(
    col: pd.Series,
    good_codes: List[str],
    fix_codes: Dict[str, str],
    bad_codes: List[str],
) -> pd.Series:
    """
    Set all codes in a column to a known standard value or NA.

    Args:
        col: the Series whose values are to be standardized.
        good_codes: known, standard codes which are expected in ``col``.
        fix_codes: a mapping from known non-standard codes to standard values.
        bad_codes: known non-standard codes which can't be mapped to a standard
            value, and so will be replaced with NA.

    Returns:
        A Series of the same dtype as the input ``col`` containing only values
        from ``good_codes``, and NA.

    Raises:
        ValueError: if there is any intersection between ``good_codes``,
            ``bad_codes``, or the keys of ``fix_codes``.
        ValueError: if any of the values in ``fix_codes`` don't also appear in
            ``good_codes``.
        ValueError: if there are any values which appear in the input ``col``
            that do **not** appear in ``good_codes``, ``fix_codes``, or
            ``bad_codes``.

    """
    # There should be no overlap betwteen good, bad, and fixable codes:
    if (
        set(good_codes).intersection(bad_codes)
        or set(good_codes).intersection(fix_codes)
        or set(fix_codes).intersection(bad_codes)
    ):
        raise ValueError(
            "Overlap found between good, bad, and fixable codes:\n"
            f"{good_codes=} \n"
            f"{bad_codes=} \n"
            f"fix_codes.keys()={list(fix_codes.keys())} \n"
        )

    # Every fixed code value should be part of good codes
    if set(fix_codes.values()).difference(good_codes):
        raise ValueError(
            "Some fixed codes aren't in the list of good codes:\n"
            f"{good_codes=} \n"
            f"fix_codes.values()={list(fix_codes.values())} \n"
        )
    # Set missing values to pd.NA. Map good codes to themselves
    code_map = {code: code for code in good_codes}
    # Add the fixable codes to the map
    code_map.update(fix_codes)
    # Map the bad codes to NA
    code_map.update({code: pd.NA for code in bad_codes})
    # Every value in the Series should now appear in the map.
    # If not we want to know about it so we don't wipe out unknown codes.
    unknown_codes = set(col.dropna()).difference(code_map)
    if unknown_codes:
        raise ValueError(
            f"Unknown codes found in {col.name}:\n"
            f"{unknown_codes=}"
        )
    # Re-map values, and make sure we retain the same dtype as input col:
    return col.map(code_map).astype(col.dtype)

For a given set of codes (here ENERGY_SOURCES_EIA) and a given table (generators_eia860), we can look up all the columns that those codes should be applied to:

    gens_resource = pudl.metadata.Resource.from_id("generators_eia860")
    # Look up all the generator fields that refer to energy_sources_eia:
    energy_source_code_cols = [
        fk.fields for fk in gens_resource.schema.foreign_keys
        if fk.reference.resource == "energy_sources_eia"
    ]
    # Flatten the list of foreign key fields:
    energy_source_code_cols = [
        col for sublist in energy_source_code_cols for col in sublist
    ]

And then iterate over those columns, cleaning them up:

    for col in energy_source_code_cols:
        gens_df[col] = pudl.helpers.standardize_codes(
            # Make sure we have uniform type and case:
            col=gens_df[col].astype('string').str.upper(),
            good_codes=ENERGY_SOURCES_EIA.code,
            bad_codes=['0', 'OO', 'BM', 'CBL', 'COL', 'N', 'OOG', 'PL', 'ST'],
            fix_codes={
                "BL": "BLQ",   # Black Liquor
                "HPS": "WAT",  # Accidental use of AER Fuel Code
                "MSB": "MSW",  # Municipal Solid Waste
                "MSN": "MSW",  # Municipal Solid Waste
                "WOC": "WC",   # Waste Coal
                "OW": "WO",    # Mistyped Wood
                "WT": "WND",   # Win
            }
        )

If we want to, then we can map the codes to labels (here putting the labels in a new columns):

    gens_df["fuel_type_code_pudl"] = (
        gens_df.energy_source_code_1
        .str.upper()
        .map(
            pudl.helpers.label_map(
                ENERGY_SOURCES_EIA,
                from_col="code",
                to_col="fuel_type_code_pudl",
                null_value=pd.NA,
            )
        )
    )

A bigger improvement

Rather than applying this process to one set of coded columns at a time, it could be automatically applied to every column in the dataframe that has a FK relationship pointing to one of the static label/code tables, analogous to the Resource.format_df() method. For that to work we'd need:

Then in theory I think we could set it up so that something as simple as the following would re-code all of the coded columns in a dataframe:

gens_resource = Resource.from_id("generators_eia860")
gens_df = gens_resource.standardize_codes(gens_df)

In addition, we would then be explicitly storing more structured information, which could be exported into the documentation:

We would also be notified any time a new, unknown code appears in the data since the ETL would break.

Tasks:

zaneselvans commented 2 years ago

Hmm, @ezwelty I'm stuck on how to add good_codes, fix_codes and bad_codes to the Pydantic metadata model. I added these fields in the Resource class definition:

good_codes: List[String] = []
bad_codes: List[String] = []
fix_codes: Dict[String, String] = {}

And then also added these root validators to check that the relationships between those field values are as expected

    @pydantic.root_validator
    def _check_no_overlapping_codes(cls, values):  # noqa: N805
        """Verify there's no overlap between good, fixable, and bad codes."""
        if (
            set(values["good_codes"]).intersection(values["bad_codes"])
            or set(values["good_codes"]).intersection(values["fix_codes"])
            or set(values["fix_codes"]).intersection(values["bad_codes"])
        ):
            raise ValueError(
                "Overlap found between good, bad, and fixable codes in\n"
                f"{values['name']}. These must all be disjoint sets:\n"
                f"{values['good_codes']=} \n"
                f"{values['bad_codes']=} \n"
                f"fix_codes.keys()={list(values['fix_codes'].keys())} \n"
            )

    @pydantic.root_validator
    def _check_fixed_codes_are_good_codes(cls, values):  # noqa: N805
        """Check that every every fixed code is also one of the good codes."""
        if set(values["fix_codes"].values()).difference(values["good_codes"]):
            raise ValueError(
                "Some fixed codes aren't in the list of good codes:\n"
                f"{values['good_codes']=} \n"
                f"fix_codes.values()={list(values['fix_codes'].values())} \n"
            )

But now I'm getting an error when I try to re-import pudl:

~/code/catalyst/pudl/src/pudl/metadata/classes.py in from_id(cls, x)
    918     def from_id(cls, x: str) -> "Resource":
    919         """Construct from PUDL identifier (`resource.name`)."""
--> 920         return cls(**cls.dict_from_id(x))
    921 
    922     def to_sql(

~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/pydantic/main.cpython-39-x86_64-linux-gnu.so in pydantic.main.BaseModel.__init__()

ValidationError: 1 validation error for Resource
__root__
  'NoneType' object is not subscriptable (type=type_error)

It seems like the default values of good_codes fix_codes and bad_codes aren't being used when those elements aren't found in the resource metadata? Not sure what I'm doing wrong here. I tried assigning the default (empty) values inside the dict_from_id() method when no value was passed in, but it didn't seem to make any difference.

zaneselvans commented 2 years ago

But anyway, my lack of understanding exactly how to make this work in Pydantic, does this seem generally seem like a good approach @ezwelty?

ezwelty commented 2 years ago

@zaneselvans Pydantic validators need to return the value they are validating (and/or modifying). You didn't return a value, hence the __root__ being None. I fixed this in a commit.

A few thoughts:

class Encoder(Base):
  df: pd.DataFrame
  fixes: Dict[String, String] = {}
  ignored: List[String] = []

  # Check df (columns and dtypes)
  # Check uniqueness of df['code'], ignored
  # Check intersections of df['code'], fixes, ignored
  # ...

  @property
  def code_map(self) -> Dict[str, Union[str, pd.NA]]:
    # ...

  def encode(self, ds: pd.Series) -> pd.Series:
    # ...

class Field(Base):
  # Field whose values are to be encoded
  # ...
  encoder: Encoder = None

class Resource(Base):
  # Resource which represents a static dataframe of codes
  # ...
  encoder: Encoder = None

note: pd.DataFrame as a type in Pydantic may require setting Base.Config.arbitrary_types_allowed = True.

zaneselvans commented 2 years ago

Oh duh. Return values. I thought it was probably something simple. I probably stayed up too late.

zaneselvans commented 2 years ago

Well, I kind of got it working, but it seems messy. Would definitely appreciate your thoughts on the current version again @ezwelty.

ezwelty commented 2 years ago

@zaneselvans There are two distinct concepts at play here:

So maybe metadata.dfs is too general, and we can focus on organizing the metadata for static code tables specifically. Lots of possible approaches, but one would be a metadata.codes with entries like:

CONTRACT_TYPES_EIA = pd.DataFrame(...)
CONTRACT_TYPES_EIA_ENCODING = {
  'fixes': {...},
  'ignored': [...]
}
# or
CONTRACT_TYPES_EIA = {
  'df': pd.DataFrame(...),
  'fixes': {...},
  'ignored': [...]
}

And then in metadata.classes:

class Encoder(Base):
  # ...
  def dict_from_id(cls, name):
    # Build the arguments to Encoder from the name of a PUDL code table.

class Package(Base):
  # ...
  def resources_from_id(cls, name):
    # ...
    # Build encoders for each resource that is a PUDL code table
    # Assign each encoder to the resource fields to be encoded

Something to keep in mind – are the encoding rules the same for all the columns pointing to a PUDL code table, or are they different and incompatible between resource groups, or between resources (e.g. {'A': 'B'} for eia columns but {'A': 'C'} for glue columns)? If so, we might need a more complex metadata design.

And we should keep in mind that there may be other transform steps that will become formalized as metadata in the future. Then Encoder would become one of several transformers that can be assigned to a Field (or a Resource, or a Package) ...

  • The Encoder model itself seems good.

A few suggestions for Encoder:

  • Adding resource-level overrides via FIELD_METADATA_BY_RESOURCE is okay. Does this seem clearly better than associating the encoder information with the resource definition?

My understanding was that the encoding logic (i.e. code_fixes and ignored_codes) is the same for all encoded fields, regardless of which resource they are in? If so, in the metadata, it definitely makes more sense to me to place these alongside the code table (see above).

  • In the Resource.encode() method I feel like I must not understand how to access the data structures correctly. Needing to iterate over lists of fields or resources to find the one with a particular name, and converting these single-element lists into the standalone value feels harder to read than it needs to be. Is there a better way to access named elements that are stored in a list?

If the encoder is assigned to the field, Resource.encode() reduces to ~field.encoder.encode(df[field.name]) for field in resource.fields if field.encoder. But to answer your question... Resource.fields and Package.resources could be converted from lists to dicts, but that means repeating the name as both the key and an attribute ({'plants_eia': {'name': 'plants_eia', ...}, ...}) and making the order less explicit (which maybe is okay). Or we can add simple helper Resource.get_field(name) and Package.get_resource(name) to do what you seek.

def get_field(self, name):
  names = [field.name for field in self.fields]
  return self.fields[names.index(name)]
  • Having to keep track of whether I've just got the dictionary/list/string version of an object, or the instantiated version of it feels kludgy.

Not sure I follow. All/most of the logic that bridges the metadata and the instantiated classes should be packed into the *dict_from_id() methods (and the *.from_id() methods that rely on them).

  • Would it be bad to have lots of different identical encoders floating around attached to different columns?

At best, they would be the same object when instantiated and assigned in Package.from_resource_ids() (see my example above). Arguably, anything related to foreign keys is best constructed there, and not in Resource.dict_from_id(), because foreign keys don't make sense unless the referenced resource is also present...

  • How much checking should we do to make sure that the DataFrame that's been passed to get encoded really corresponds to the table that's being described by the Resource?

Do you mean Encoder.df? Then see my comments about that class. Otherwise, it seems like all we need to worry about right now is whether the column to be encoded encodes without error?

  • Another issue that came up here is that currently the fuel_transportation_modes_eia table uses label as its primary key, and the values in the dataframes/database are the human readable labels, not the original short codes. While the encoder is attached to the code columns. So purely using the foreign key relationship to choose what reference column to look at might be too restrictive. The original codes are the things that need to be fixed. If we want to use longer human readable labels then after they get fixed we need to map from the original codes to the labels, and the label column would be the primary key / foreign key column.

Would that not be the case for all tables referencing a code table? My understanding was that code is the (perhaps cleaned-up) original codes and label (or pudl_code, etc) is the more readable code actually used in the PUDL data. If label is not used in the data instead of code, what purpose does it serve?

zaneselvans commented 2 years ago

Why not use a root validator?

All of the validations of codes depend on more than one attribute of the class. Why wouldn't we want to use a root_validator in that situation? If we skip the validation check when one of the other attributes hasn't been validated yet (and so isn't in the values dictionary yet), how do we know that the validation will ever happen?

It seemed like validation that depends on more than one attribute is what the root validator was being used for in their examples.

Encoding without a whole package in the transform step

If it only makes sense to assign Encoders to the fields in the context of creating a whole package, what's the right way to use it in the context of a transform function? If we're inside the transform function for generation_fuel_eia923 then we know that that table is destined exist in the database, and we know that the static tables defining the codes for some of its columns will exist (because they're static, and they always exist), but we don't necessarily know all of the other tables that will exist in the output DB. We can create a full Package that reflects all of the possible tables, but I wonder if the mismatch between that Package and the tables that are actually going through the ETL process will create unexpected problems.

Also what if we can't simultaneously encode all of the columns within a table -- like if we're using information from the fuel_type_aer column to fill in missing values in the energy_source_eia column there might be a dependency between the columns -- some cleanup has to happen in one, before cleanup can happen in another.

Identifying encoding tables

If we associate Encoders with columns when Package.from_resource_ids() is called, and we iterate over all of the fields (or foreign key relationships) for each resource, how do we identify which FK relations mean we need to add an encoder and how do we construct that encoder? It seems like it needs to be connected to the metadata somehow so we can look up the actual data associated with the encoding table.

From looking at a FK / the metadata, I can see that energy_source_1 is constrained by energy_sources_eia.code but then I need to be able to look up the actual df and code_fixes and ignored_codes with which to create the Encoder that will be associated with the Field for later use.

ezwelty commented 2 years ago

Your last two points might be better worked through on a call. For the first:

All of the validations of codes depend on more than one attribute of the class. Why wouldn't we want to use a root_validator in that situation? If we skip the validation check when one of the other attributes hasn't been validated yet (and so isn't in the values dictionary yet), how do we know that the validation will ever happen?

Well, the first Pydantic example uses a @validator('password2') to check that it matches password: https://pydantic-docs.helpmanual.io/usage/validators/#pre-and-per-item-validators. A @root_validator would also work here, but I find @validator clearer when tan attribute is dependent on others that need to be validated first. Plus the error report is labeled with attribute names, rather than __root__.

from pydantic import BaseModel, validator, root_validator, StrictStr

class Example(BaseModel):
    name: StrictStr
    name2: StrictStr
    name3: StrictStr

    @validator('name2')
    def names_match(cls, v, values):
        if v != values['name']:
            raise ValueError('does not match name')
        return v

    @validator('name3')
    def names_match_again(cls, v, values):
        if v != values['name']:
            raise ValueError('does not match name')
        return v

Example(name='ethan', name2='zane', name3='otto')
ValidationError: 2 validation errors for Example
name2
  does not match name (type=value_error)
name3
  does not match name (type=value_error)
class Example(BaseModel):
    name: StrictStr
    name2: StrictStr
    name3: StrictStr

    @root_validator
    def names_match(cls, values):
        if values['name'] != values['name2']:
            raise ValueError('name 2 does not match name')
        return values

    @root_validator
    def names_match_again(cls, values):
        if values['name'] != values['name3']:
            raise ValueError('name 3 does not match name')
        return values

Example(name='ethan', name2='zane', name3='otto')
ValidationError: 2 validation errors for Example
__root__
  name 2 does not match name (type=value_error)
__root__
  name 3 does not match name (type=value_error)

What is definitely important is checking for an attribute in values before accessing it. Otherwise, pydantic crashes and you get nothing. So with either of my examples above, if name is not a StrictStr:

Example(name=123, name2='zane', name3='otto')

KeyError: 'name'

Doh!

zaneselvans commented 2 years ago

I ended up making it possible to encode either a whole dataframe or just a single column at a time, in the event that there are dependencies between columns that need to be cleaned up.

On trying to figure out how to look up the encoding table for a given column, I ended up associating Encoders with both Resource/Table level metadata and Field/Column level metadata. If there's an Encoder attached to a Resource, then it's a coding table. If it's attached to a Field/Column, then it's the Encoder that should be used to encode that field/column.

I'm still confused about the root validator vs. the dependency between two fields in a field-level validator. In the password1 / password2 example it feels like the dependency is clear -- the validity of password2 depends on the value of password1 but not vice-versa. Whereas with the good codes, ignored codes, and fixed/fixable codes, it feels like the dependency is mutual.

But more importantly -- and maybe there's magic going on that I'm not aware of in the background -- if the validation of one value is conditional on the existence of another value in the values then how do you know that all of the values are actually getting validated? Like what are all the scenarios in which you could have some of the values defined and not others? Bad input values? Optional parameters? Etc.

ezwelty commented 2 years ago

Quoting from the pydantic documentation (https://pydantic-docs.helpmanual.io/usage/validators):

where validators rely on other values, you should be aware that:

  • Validation is done in the order fields are defined. E.g. in the example above, password2 has access to password1 (and name), but password1 does not have access to password2. See Field Ordering for more information on how fields are ordered
  • If validation fails on another field (or that field is missing) it will not be included in values, hence if 'password1' in values and ... in this example.

This seems like a reasonable default. If the first field is invalid (e.g. wrong type), there is no sense using it in further validations since that code will likely fail. For example, as written, the Encoder validators assume that df is in values, and thus that it is a pd.DataFrame with columns code and description. So these will crash if df is invalid (just like they would if an invalid df were included in values). All you need to do is if 'df' in values to know that you can proceed to access values['df']['code']...

zaneselvans commented 2 years ago

Okay, I finally made these changes and added docstrings as requested by @bendnorman in #1272. Going to merge it into dev.