Generalize process for standardizing and remapping code and label columns

I found a bunch of foreign key violations while compiling and standardizing the labeling tables (see #1252) , which isn't surprising, since we're linking up these coded columns with their home tables / canonical values for the first time and enforcing the relationship. The current arrangement is pretty ad-hoc, and I think we can take advantage of the new metadata setup to make it much, much better. Both more complete and easier to maintain and automatically document.

Current situation

We have a bunch of hand-built transform functions which were largely developed in notebooks, and still haven't been generalized.
Many of these functions are full of "magic numbers" -- lists of bad values that need to be replaced, lists of columns that particular transformations should be applied to etc.
This means that we often have several columns in different tables that should be constrained to contain the same set of values,(prime_mover_code or energy_source, etc.) but there's nothing ensuring that they're all processed the same way, or that we've caught all of the columns that should be getting that treatment.
We also have a mix of df[col].replace() and df[col].map() operations happening. Replace is slow, but lets you change some values without clobbering others. Map is fast, but requires that every single existing value have a corresponding new value, or be set to NA. In some cases we've inadvertently wiped out entire columns like this.
We also aren't doing the same thing to every coded column -- in some cases we leave the original codes. In some we clean them up to conform to the canonical values defined by EIA. In some cases we map them to human readable labels, losing the information about what data was there originally.

A slight improvement

With the new metadata system, it's easy to look up all of the columns in a table that have foreign key relationships to other tables, so for these coded columns with a predetermined set of valid values, we don't have to keep lists of column names in the transform functions. All we need to know is what table it is we're building and we can look up which columns in that table are constrained by values in what columns in other tables.
Turning all of the label / code tables in to static dataframes which are available from within PUDL even before they're loaded into the database means that we can also look up what the allowable values are (we can also do this with enum constrained columns), and also what values they ought to be mapped to (if we're translating a code to a readable label).
To fully clean the raw coded columns though, we ideally want to have a couple of other pieces of information as well. We need to know any non-standard codes that can be mapped to the standard codes to repair the original data. We'd also like to have a list of all the known bad values (unless there's a bajillion of them as in FERC). Having all of the good, fixable, and bad codes enumerated means that we can use map() rather than replace() and still be sure that we aren't accidentally wiping out something we don't understand. This is what pudl.helpers.standardize_codes() does now:

def standardize_codes(
    col: pd.Series,
    good_codes: List[str],
    fix_codes: Dict[str, str],
    bad_codes: List[str],
) -> pd.Series:
    """
    Set all codes in a column to a known standard value or NA.

    Args:
        col: the Series whose values are to be standardized.
        good_codes: known, standard codes which are expected in ``col``.
        fix_codes: a mapping from known non-standard codes to standard values.
        bad_codes: known non-standard codes which can't be mapped to a standard
            value, and so will be replaced with NA.

    Returns:
        A Series of the same dtype as the input ``col`` containing only values
        from ``good_codes``, and NA.

    Raises:
        ValueError: if there is any intersection between ``good_codes``,
            ``bad_codes``, or the keys of ``fix_codes``.
        ValueError: if any of the values in ``fix_codes`` don't also appear in
            ``good_codes``.
        ValueError: if there are any values which appear in the input ``col``
            that do **not** appear in ``good_codes``, ``fix_codes``, or
            ``bad_codes``.

    """
    # There should be no overlap betwteen good, bad, and fixable codes:
    if (
        set(good_codes).intersection(bad_codes)
        or set(good_codes).intersection(fix_codes)
        or set(fix_codes).intersection(bad_codes)
    ):
        raise ValueError(
            "Overlap found between good, bad, and fixable codes:\n"
            f"{good_codes=} \n"
            f"{bad_codes=} \n"
            f"fix_codes.keys()={list(fix_codes.keys())} \n"
        )

    # Every fixed code value should be part of good codes
    if set(fix_codes.values()).difference(good_codes):
        raise ValueError(
            "Some fixed codes aren't in the list of good codes:\n"
            f"{good_codes=} \n"
            f"fix_codes.values()={list(fix_codes.values())} \n"
        )
    # Set missing values to pd.NA. Map good codes to themselves
    code_map = {code: code for code in good_codes}
    # Add the fixable codes to the map
    code_map.update(fix_codes)
    # Map the bad codes to NA
    code_map.update({code: pd.NA for code in bad_codes})
    # Every value in the Series should now appear in the map.
    # If not we want to know about it so we don't wipe out unknown codes.
    unknown_codes = set(col.dropna()).difference(code_map)
    if unknown_codes:
        raise ValueError(
            f"Unknown codes found in {col.name}:\n"
            f"{unknown_codes=}"
        )
    # Re-map values, and make sure we retain the same dtype as input col:
    return col.map(code_map).astype(col.dtype)

For a given set of codes (here ENERGY_SOURCES_EIA) and a given table (generators_eia860), we can look up all the columns that those codes should be applied to:

    gens_resource = pudl.metadata.Resource.from_id("generators_eia860")
    # Look up all the generator fields that refer to energy_sources_eia:
    energy_source_code_cols = [
        fk.fields for fk in gens_resource.schema.foreign_keys
        if fk.reference.resource == "energy_sources_eia"
    ]
    # Flatten the list of foreign key fields:
    energy_source_code_cols = [
        col for sublist in energy_source_code_cols for col in sublist
    ]

And then iterate over those columns, cleaning them up:

    for col in energy_source_code_cols:
        gens_df[col] = pudl.helpers.standardize_codes(
            # Make sure we have uniform type and case:
            col=gens_df[col].astype('string').str.upper(),
            good_codes=ENERGY_SOURCES_EIA.code,
            bad_codes=['0', 'OO', 'BM', 'CBL', 'COL', 'N', 'OOG', 'PL', 'ST'],
            fix_codes={
                "BL": "BLQ",   # Black Liquor
                "HPS": "WAT",  # Accidental use of AER Fuel Code
                "MSB": "MSW",  # Municipal Solid Waste
                "MSN": "MSW",  # Municipal Solid Waste
                "WOC": "WC",   # Waste Coal
                "OW": "WO",    # Mistyped Wood
                "WT": "WND",   # Win
            }
        )

If we want to, then we can map the codes to labels (here putting the labels in a new columns):

    gens_df["fuel_type_code_pudl"] = (
        gens_df.energy_source_code_1
        .str.upper()
        .map(
            pudl.helpers.label_map(
                ENERGY_SOURCES_EIA,
                from_col="code",
                to_col="fuel_type_code_pudl",
                null_value=pd.NA,
            )
        )
    )

A bigger improvement

Rather than applying this process to one set of coded columns at a time, it could be automatically applied to every column in the dataframe that has a FK relationship pointing to one of the static label/code tables, analogous to the Resource.format_df() method. For that to work we'd need:

To store the fix_codes and bad_codes in close association with the table/dataframe that contains the good_codes

A way of looking up where to find the good/bad/fixable codes, based on the name of the database table they'll end up in eventually. If they were already in the database (loaded right at the beginning of the ETL?), then we could just use the referenced table from the foreign key relationship. If they're still just static dataframes, we'd need to store that association somewhere or have a naming convention. I guess we could just store this information in the RESOURCE_METADATA, like:

"energy_sources_eia": {
    "schema": {
        "fields": [
            "code",
            "label",
            "fuel_units",
            "min_fuel_mmbtu_per_unit",
            "max_fuel_mmbtu_per_unit",
            "fuel_group_eia",
            "fuel_derived_from",
            "fuel_phase",
            "fuel_type_code_pudl",
            "description",
        ],
        "primary_key": ["code"],
        "foreign_key_rules": {
            "fields": [
                ["energy_source_code"],
                ["energy_source_code_1"],
                ["energy_source_code_2"],
                ["energy_source_code_3"],
                ["energy_source_code_4"],
                ["energy_source_code_5"],
                ["energy_source_code_6"],
                ["startup_source_code_1"],
                ["startup_source_code_2"],
                ["startup_source_code_3"],
                ["startup_source_code_4"],
                ["planned_energy_source_code_1"],
                ["fuel_type"],
                ["fuel_type_code"],
            ],
            "exclude": ["plants_small_ferc1"]
        },
    },
    "static_df": pudl.metadata.dfs.ENERGY_SOURCES_EIA,
    "fix_codes": {
        "BL": "BLQ",   # Black Liquor
        "HPS": "WAT",  # Accidental use of AER Fuel Code
        "MSB": "MSW",  # Municipal Solid Waste
        "MSN": "MSW",  # Municipal Solid Waste
        "WOC": "WC",   # Waste Coal
        "OW": "WO",    # Mistyped Wood
        "WT": "WND",   # Wind
    },
    "bad_codes": ['0', 'OO', 'BM', 'CBL', 'COL', 'N', 'NO', 'OOG', 'PL', 'ST'],
    "sources": ["eia923"],
},

Then in theory I think we could set it up so that something as simple as the following would re-code all of the coded columns in a dataframe:

gens_resource = Resource.from_id("generators_eia860")
gens_df = gens_resource.standardize_codes(gens_df)

In addition, we would then be explicitly storing more structured information, which could be exported into the documentation:

The standard codes (with any associated metadata).
The mappings from non-standard codes to standard codes.
A list of all known bad codes.

We would also be notified any time a new, unknown code appears in the data since the ETL would break.

Tasks:

[x] Undo the lumping of biogenic & non-biogenic municipal solid waste (MSW / MSN / MSB)
[x] Associate code_fixes and ignored_codes with the static dataframe(s) under metadata
[x] allow arbitrary types in the Base model
[x] split _check_no_overlapping_codes() into two functions and turn them into validators
[x] add Resource.get_field(name) and Package.get_resource(name) helper functions.
[x] Allow type casting of the encoded column to be dynamic.
[x] Associate the encoders with each individual field within a resource that's part of a package. Do this in Package.from_resource_ids() so all foreign key relationships will exist.
[x] Implement Resource.encode(df)
[x] Test using Resource.encode(df) for energy_sources_eia coded columns
[x] Figure out if we need a Field.encode(Series) or if we can always encode all columns in a table simultaneously.
[x] Remove the few existing translations from codes to labels, so we can have a uniform process at the moment. We can switch them all over at the same time soon, but it might break stuff and I don't think we want to handle that right now.

Hmm, @ezwelty I'm stuck on how to add good_codes, fix_codes and bad_codes to the Pydantic metadata model. I added these fields in the Resource class definition:

good_codes: List[String] = []
bad_codes: List[String] = []
fix_codes: Dict[String, String] = {}

And then also added these root validators to check that the relationships between those field values are as expected

    @pydantic.root_validator
    def _check_no_overlapping_codes(cls, values):  # noqa: N805
        """Verify there's no overlap between good, fixable, and bad codes."""
        if (
            set(values["good_codes"]).intersection(values["bad_codes"])
            or set(values["good_codes"]).intersection(values["fix_codes"])
            or set(values["fix_codes"]).intersection(values["bad_codes"])
        ):
            raise ValueError(
                "Overlap found between good, bad, and fixable codes in\n"
                f"{values['name']}. These must all be disjoint sets:\n"
                f"{values['good_codes']=} \n"
                f"{values['bad_codes']=} \n"
                f"fix_codes.keys()={list(values['fix_codes'].keys())} \n"
            )

    @pydantic.root_validator
    def _check_fixed_codes_are_good_codes(cls, values):  # noqa: N805
        """Check that every every fixed code is also one of the good codes."""
        if set(values["fix_codes"].values()).difference(values["good_codes"]):
            raise ValueError(
                "Some fixed codes aren't in the list of good codes:\n"
                f"{values['good_codes']=} \n"
                f"fix_codes.values()={list(values['fix_codes'].values())} \n"
            )

But now I'm getting an error when I try to re-import pudl:

~/code/catalyst/pudl/src/pudl/metadata/classes.py in from_id(cls, x)
    918     def from_id(cls, x: str) -> "Resource":
    919         """Construct from PUDL identifier (`resource.name`)."""
--> 920         return cls(**cls.dict_from_id(x))
    921 
    922     def to_sql(

~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/pydantic/main.cpython-39-x86_64-linux-gnu.so in pydantic.main.BaseModel.__init__()

ValidationError: 1 validation error for Resource
__root__
  'NoneType' object is not subscriptable (type=type_error)

It seems like the default values of good_codes fix_codes and bad_codes aren't being used when those elements aren't found in the resource metadata? Not sure what I'm doing wrong here. I tried assigning the default (empty) values inside the dict_from_id() method when no value was passed in, but it didn't seem to make any difference.

But anyway, my lack of understanding exactly how to make this work in Pydantic, does this seem generally seem like a good approach @ezwelty?

@zaneselvans Pydantic validators need to return the value they are validating (and/or modifying). You didn't return a value, hence the __root__ being None. I fixed this in a commit.

A few thoughts:

For the static data frames, consider using pd.DataFrame.convert_dtypes() instead of pd.DataFrame.astype(...) if your goal is just to get column types that support pd.NA. Or better, their specific column types can be set as needed downstream based on their metadata description.
Is it worth explicitly enumerating the bad_codes? I presume this is so that you can choose which ones to ignore and address new bad codes as they crop up?
I would suggest creating a new model to hold this functionality. Below is one possible approach. I drop good_codes since this is simply df['code'], rename fix_codes to fixes (or code_fixes) since it is an attribute, not a function, and rename bad_codes to ignored to clarify how these values are handled ("bad" is a bit vague). This could be part of Resource, but since it has to do with transforming the values of a single column, I suspect it will fit better in Field. Then Resource.dict_from_id() would populate fields[*].encoder based on foreign keys or whatever logic we choose, and use e.g. Resource.encode(df) to apply all fields[*].encoder.encode(). The metadata for this could be stored in RESOURCE_METADATA under resource.encoder or even resource.foreign_key_rules.encoder. You could avoid imports of static dataframes in RESOURCE_METADATA and instead look them up as resource.name.upper() – or is that too much magic?

class Encoder(Base):
  df: pd.DataFrame
  fixes: Dict[String, String] = {}
  ignored: List[String] = []

  # Check df (columns and dtypes)
  # Check uniqueness of df['code'], ignored
  # Check intersections of df['code'], fixes, ignored
  # ...

  @property
  def code_map(self) -> Dict[str, Union[str, pd.NA]]:
    # ...

  def encode(self, ds: pd.Series) -> pd.Series:
    # ...

class Field(Base):
  # Field whose values are to be encoded
  # ...
  encoder: Encoder = None

class Resource(Base):
  # Resource which represents a static dataframe of codes
  # ...
  encoder: Encoder = None

note: pd.DataFrame as a type in Pydantic may require setting Base.Config.arbitrary_types_allowed = True.

Oh duh. Return values. I thought it was probably something simple. I probably stayed up too late.

data types I was thinking about trying to set the data types on the dataframes using the information stored in the metadata but it seemed like this might result in some kind of circular dependency and I wanted to be able to access the data directly with the correct types. I had not come across df.convert_dtypes() before though. I think usually we're just trying to get NA compatible non-object types wherever possible.
bad codes I'm intentionally being conservative and storing the known bad_codes because we don't have control over the input data and they revise it and the schemas applied to it without any warning or documentation. If they change the set of codes that's being used (e.g. going from WA to WT for water, or CV to TC for "tramways and conveyors" or they add a new OP code for "onsite production" -- all of which have happened) or a new unrecognized code that's actually fixable shows up in a new year of data, we want to be alerted to the change, and have the opportunity to update the metadata or develop a fix. At least in the EIA data, the number of bad codes has been very manageable so far, so this seems like a reasonable approach. Unfortunately FERC isn't so good.
new model I started down the road of adding a static_df field to the Resource model for the static resources, so that this data could be accessed through the classes rather than directly but ran into the arbitrary_types_allowed issue and wasn't sure if this was really the right approach. Do you think it makes sense to store the static data directly inside the Resource so it can be accessed throughout the ETL process before we have a database to work with? All these static tables should be small.
field vs. resource Having the Encoder be associated with the Field it encodes rather than the whole Resource makes sense. However, every static table has the same field names and many of them exist within the same (eia) data group. I'm not seeing a way to override them on a per-table basis right now so that each of the code columns can have its own distinct Encoder. I guess I can just replicate the pattern of the group-level overrides but at the resource level.
Awareness of Resource & Field names in the Encoder It would be nice if errors raised during encoding due to unknown codes showing up could tell the user what column and table the code was found in, so they can go investigate more efficiently. Is there some built in mechanism for tracking this kind of thing? Or do I need to have resource / field elements that are inside the Encoder, that are set when it's being initialized?
What Field is the Encoder part of? I'd been thinking that the Encoder would be associated with the home Field that stores the codes, rather than having a separate Encoder associated with every field that needs to be encoded, but above it seems like you're imagining the encoder being attached to each field that might be encoded? I guess in the problem I've been trying to solve these codes are foreign key values, so the information you need to have to find the valid codes is only available at the resource level, where the schema defines the foreign key relationship. But the same process could be applied to an ENUM constrained field, if it also had a list of code_fixes and ignored_codes and in that case you'd want it to be attached directly to the field that's going to get cleaned up. But in the FK case we only want to define this information in one place, and the only obvious unique place to do that is in the home table/column for the good codes. Hmm.

Well, I kind of got it working, but it seems messy. Would definitely appreciate your thoughts on the current version again @ezwelty.

The Encoder model itself seems good.
Adding resource-level overrides via FIELD_METADATA_BY_RESOURCE is okay. Does this seem clearly better than associating the encoder information with the resource definition?
In the Resource.encode() method I feel like I must not understand how to access the data structures correctly. Needing to iterate over lists of fields or resources to find the one with a particular name, and converting these single-element lists into the standalone value feels harder to read than it needs to be. Is there a better way to access named elements that are stored in a list?
Having to keep track of whether I've just got the dictionary/list/string version of an object, or the instantiated version of it feels kludgy.
Would it be better to associate an Encoder with every column that needs to be encoded, rather than just with the column that has all of the information about the encoding? That would move a lot of the logic that's in Resource.encode() into Resource.dict_from_id(). Would it be bad to have lots of different identical encoders floating around attached to different columns? Then using a resource to encode a dataframe could just be iterating over all of the columns that appear in both of them and calling the encoder if it's not None.
How much checking should we do to make sure that the DataFrame that's been passed to get encoded really corresponds to the table that's being described by the Resource?
Another issue that came up here is that currently the fuel_transportation_modes_eia table uses label as its primary key, and the values in the dataframes/database are the human readable labels, not the original short codes. While the encoder is attached to the code columns. So purely using the foreign key relationship to choose what reference column to look at might be too restrictive. The original codes are the things that need to be fixed. If we want to use longer human readable labels then after they get fixed we need to map from the original codes to the labels, and the label column would be the primary key / foreign key column.

@zaneselvans There are two distinct concepts at play here:

A resource whose data is a constant. I think we can treat these as any other table in the ETL, except that they are introduced later in the pipeline. The data can be linked to the metadata explicitly via a df attribute of Resource, or based on a naming convention in Resource.from_id(). But we don't need that for anything yet, and it seems secondary to the main design challenge here, which is:
Metadata that controls a certain column is transformed towards a valid enum/foreign key relationship. From this perspective, it makes sense to me that an Encoder be associated with the Field it encodes, rather than with the Resource or Field to which the foreign key points. That way the encoding becomes just one type of transformation/validation, and Resource.encode() (Field.encode()) can raise errors that are resource/field-aware. That said, the structure of the underlying metadata need not match the classes that perform the metadata-driven operation (e.g. like the metadata {reference_resource}.foreign_key_rules that bevomes Resource.Schema.foreign_keys). In the metadata, I suspect that code_fixes and ignored_codes are most easily maintained alongside the static dataframe listing the codes and their descriptions.

So maybe metadata.dfs is too general, and we can focus on organizing the metadata for static code tables specifically. Lots of possible approaches, but one would be a metadata.codes with entries like:

CONTRACT_TYPES_EIA = pd.DataFrame(...)
CONTRACT_TYPES_EIA_ENCODING = {
  'fixes': {...},
  'ignored': [...]
}
# or
CONTRACT_TYPES_EIA = {
  'df': pd.DataFrame(...),
  'fixes': {...},
  'ignored': [...]
}

And then in metadata.classes:

class Encoder(Base):
  # ...
  def dict_from_id(cls, name):
    # Build the arguments to Encoder from the name of a PUDL code table.

class Package(Base):
  # ...
  def resources_from_id(cls, name):
    # ...
    # Build encoders for each resource that is a PUDL code table
    # Assign each encoder to the resource fields to be encoded

Something to keep in mind – are the encoding rules the same for all the columns pointing to a PUDL code table, or are they different and incompatible between resource groups, or between resources (e.g. {'A': 'B'} for eia columns but {'A': 'C'} for glue columns)? If so, we might need a more complex metadata design.

And we should keep in mind that there may be other transform steps that will become formalized as metadata in the future. Then Encoder would become one of several transformers that can be assigned to a Field (or a Resource, or a Package) ...

The Encoder model itself seems good.

A few suggestions for Encoder:

Especially since dataframes may crop up in other models, I would just add arbitrary_types_allowed = True to Base.
Add a @pydantic.validator('df') to Encoder to first check the columns df.
Convert _check_fixed_codes_are_good_codes() to a @pydantic.validator('code_fixes') with arguments (cls, value, values), where values give you access to any already-validated attributes (here, df). The check should only run if 'df' in values (true only if df was valid), otherwise the validator will fail for reasons other than an invalid code_fixes. See https://pydantic-docs.helpmanual.io/usage/validators. Consider this for _check_no_overlapping_codes(), which you may want to break up into individual validators: one for code_fixes (combine with above) and one for ignored_codes. The code is cleaner and results in clearer validation reports.
For Encoder.encode(), consider adding a dtype= argument and not type-casting by default so that we can type-cast directly to field.get_pandas_dtype() from calling methods.

Adding resource-level overrides via FIELD_METADATA_BY_RESOURCE is okay. Does this seem clearly better than associating the encoder information with the resource definition?

My understanding was that the encoding logic (i.e. code_fixes and ignored_codes) is the same for all encoded fields, regardless of which resource they are in? If so, in the metadata, it definitely makes more sense to me to place these alongside the code table (see above).

In the Resource.encode() method I feel like I must not understand how to access the data structures correctly. Needing to iterate over lists of fields or resources to find the one with a particular name, and converting these single-element lists into the standalone value feels harder to read than it needs to be. Is there a better way to access named elements that are stored in a list?

If the encoder is assigned to the field, Resource.encode() reduces to ~field.encoder.encode(df[field.name]) for field in resource.fields if field.encoder. But to answer your question... Resource.fields and Package.resources could be converted from lists to dicts, but that means repeating the name as both the key and an attribute ({'plants_eia': {'name': 'plants_eia', ...}, ...}) and making the order less explicit (which maybe is okay). Or we can add simple helper Resource.get_field(name) and Package.get_resource(name) to do what you seek.

def get_field(self, name):
  names = [field.name for field in self.fields]
  return self.fields[names.index(name)]

Having to keep track of whether I've just got the dictionary/list/string version of an object, or the instantiated version of it feels kludgy.

Not sure I follow. All/most of the logic that bridges the metadata and the instantiated classes should be packed into the *dict_from_id() methods (and the *.from_id() methods that rely on them).

Would it be bad to have lots of different identical encoders floating around attached to different columns?

At best, they would be the same object when instantiated and assigned in Package.from_resource_ids() (see my example above). Arguably, anything related to foreign keys is best constructed there, and not in Resource.dict_from_id(), because foreign keys don't make sense unless the referenced resource is also present...

How much checking should we do to make sure that the DataFrame that's been passed to get encoded really corresponds to the table that's being described by the Resource?

Do you mean Encoder.df? Then see my comments about that class. Otherwise, it seems like all we need to worry about right now is whether the column to be encoded encodes without error?

Another issue that came up here is that currently the fuel_transportation_modes_eia table uses label as its primary key, and the values in the dataframes/database are the human readable labels, not the original short codes. While the encoder is attached to the code columns. So purely using the foreign key relationship to choose what reference column to look at might be too restrictive. The original codes are the things that need to be fixed. If we want to use longer human readable labels then after they get fixed we need to map from the original codes to the labels, and the label column would be the primary key / foreign key column.

Would that not be the case for all tables referencing a code table? My understanding was that code is the (perhaps cleaned-up) original codes and label (or pudl_code, etc) is the more readable code actually used in the PUDL data. If label is not used in the data instead of code, what purpose does it serve?

Why not use a root validator?

All of the validations of codes depend on more than one attribute of the class. Why wouldn't we want to use a root_validator in that situation? If we skip the validation check when one of the other attributes hasn't been validated yet (and so isn't in the values dictionary yet), how do we know that the validation will ever happen?

It seemed like validation that depends on more than one attribute is what the root validator was being used for in their examples.

Encoding without a whole package in the transform step

If it only makes sense to assign Encoders to the fields in the context of creating a whole package, what's the right way to use it in the context of a transform function? If we're inside the transform function for generation_fuel_eia923 then we know that that table is destined exist in the database, and we know that the static tables defining the codes for some of its columns will exist (because they're static, and they always exist), but we don't necessarily know all of the other tables that will exist in the output DB. We can create a full Package that reflects all of the possible tables, but I wonder if the mismatch between that Package and the tables that are actually going through the ETL process will create unexpected problems.

Also what if we can't simultaneously encode all of the columns within a table -- like if we're using information from the fuel_type_aer column to fill in missing values in the energy_source_eia column there might be a dependency between the columns -- some cleanup has to happen in one, before cleanup can happen in another.

Identifying encoding tables

If we associate Encoders with columns when Package.from_resource_ids() is called, and we iterate over all of the fields (or foreign key relationships) for each resource, how do we identify which FK relations mean we need to add an encoder and how do we construct that encoder? It seems like it needs to be connected to the metadata somehow so we can look up the actual data associated with the encoding table.

From looking at a FK / the metadata, I can see that energy_source_1 is constrained by energy_sources_eia.code but then I need to be able to look up the actual df and code_fixes and ignored_codes with which to create the Encoder that will be associated with the Field for later use.

Your last two points might be better worked through on a call. For the first:

All of the validations of codes depend on more than one attribute of the class. Why wouldn't we want to use a root_validator in that situation? If we skip the validation check when one of the other attributes hasn't been validated yet (and so isn't in the values dictionary yet), how do we know that the validation will ever happen?

Well, the first Pydantic example uses a @validator('password2') to check that it matches password: https://pydantic-docs.helpmanual.io/usage/validators/#pre-and-per-item-validators. A @root_validator would also work here, but I find @validator clearer when tan attribute is dependent on others that need to be validated first. Plus the error report is labeled with attribute names, rather than __root__.

from pydantic import BaseModel, validator, root_validator, StrictStr

class Example(BaseModel):
    name: StrictStr
    name2: StrictStr
    name3: StrictStr

    @validator('name2')
    def names_match(cls, v, values):
        if v != values['name']:
            raise ValueError('does not match name')
        return v

    @validator('name3')
    def names_match_again(cls, v, values):
        if v != values['name']:
            raise ValueError('does not match name')
        return v

Example(name='ethan', name2='zane', name3='otto')

ValidationError: 2 validation errors for Example
name2
  does not match name (type=value_error)
name3
  does not match name (type=value_error)

class Example(BaseModel):
    name: StrictStr
    name2: StrictStr
    name3: StrictStr

    @root_validator
    def names_match(cls, values):
        if values['name'] != values['name2']:
            raise ValueError('name 2 does not match name')
        return values

    @root_validator
    def names_match_again(cls, values):
        if values['name'] != values['name3']:
            raise ValueError('name 3 does not match name')
        return values

Example(name='ethan', name2='zane', name3='otto')

ValidationError: 2 validation errors for Example
__root__
  name 2 does not match name (type=value_error)
__root__
  name 3 does not match name (type=value_error)

What is definitely important is checking for an attribute in values before accessing it. Otherwise, pydantic crashes and you get nothing. So with either of my examples above, if name is not a StrictStr:

Example(name=123, name2='zane', name3='otto')

KeyError: 'name'

Doh!

I ended up making it possible to encode either a whole dataframe or just a single column at a time, in the event that there are dependencies between columns that need to be cleaned up.

On trying to figure out how to look up the encoding table for a given column, I ended up associating Encoders with both Resource/Table level metadata and Field/Column level metadata. If there's an Encoder attached to a Resource, then it's a coding table. If it's attached to a Field/Column, then it's the Encoder that should be used to encode that field/column.

I'm still confused about the root validator vs. the dependency between two fields in a field-level validator. In the password1 / password2 example it feels like the dependency is clear -- the validity of password2 depends on the value of password1 but not vice-versa. Whereas with the good codes, ignored codes, and fixed/fixable codes, it feels like the dependency is mutual.

But more importantly -- and maybe there's magic going on that I'm not aware of in the background -- if the validation of one value is conditional on the existence of another value in the values then how do you know that all of the values are actually getting validated? Like what are all the scenarios in which you could have some of the values defined and not others? Bad input values? Optional parameters? Etc.

Quoting from the pydantic documentation (https://pydantic-docs.helpmanual.io/usage/validators):

where validators rely on other values, you should be aware that:

Validation is done in the order fields are defined. E.g. in the example above, password2 has access to password1 (and name), but password1 does not have access to password2. See Field Ordering for more information on how fields are ordered

If validation fails on another field (or that field is missing) it will not be included in values, hence if 'password1' in values and ... in this example.

This seems like a reasonable default. If the first field is invalid (e.g. wrong type), there is no sense using it in further validations since that code will likely fail. For example, as written, the Encoder validators assume that df is in values, and thus that it is a pd.DataFrame with columns code and description. So these will crash if df is invalid (just like they would if an invalid df were included in values). All you need to do is if 'df' in values to know that you can proceed to access values['df']['code']...

Okay, I finally made these changes and added docstrings as requested by @bendnorman in #1272. Going to merge it into dev.

catalyst-cooperative / pudl