Identify common mutations in FERC steam plants data

Purpose

We want to settle on a final validation metric to feel "good enough" about merging the FERC-FERC plant matching refactor (#3137) into PUDL. Because there is no ground truth to test on, we are exploring multiple possible testing strategies to give confidence in the model. The two strategies we are currently considering are generative testing (create fake data where we know what records should be matched and see how the model does), and metamorphic testing (apply matching to test dataset, mutate that dataset in ways the model should be able to handle, test again verifying results of the matching have not changed beyond a certain threshold). Both of these approaches require us to be able to accurately characterize the types of mutations present in the actual data, and to accurately simulate those mutations.

@zaneselvans found a number of cases where the model seems to not be mismatching records:

bad_plants = {
    # APS Yucca 3 misses the plant in 2021-2022 solely due to name change and a blip in fuel fraction splits. Everything else is consistent. Seems too sensitive.
    "yucca3": lambda steam: (steam.plant_id_pudl == 1013),
    # The columbia power station in Wisconsin seems genuinely complicated. It's a mishmash of different units with different owners.
    # Sometimes discontinuous years. Getting multiple records from the same year assigned the same plant_id_ferc1.
    # Sometimes big changes in the capacity within the same plant_id_ferc1.
    "columbia": lambda steam: (steam.plant_id_pudl == 124),
    # APS West Phoenix 5, data for 2021-2022 are not associated with other records due to different name.
    "west_phoenix_5": lambda steam: (steam.plant_name_ferc1.str.contains(r"west phoenix.*5", regex=True, case=False)),
    # sterling avenue plant is very consistent except for construction_type, which is usually null but has values in a couple of years,
    # so in those years it gets split off from all the other years, e.g. in 2003
    "sterling": lambda steam: (steam.plant_name_ferc1.str.contains(r"sterling", regex=True, case=False)),
    # Jeffrey Energy Center is getting split up badly due to minor changes in the name
    "jeffrey": lambda steam: (steam.plant_name_ferc1.str.contains(r"jeffrey", regex=True, case=False)) & (steam.utility_id_ferc1==255),
    "jeffrey_8pct": lambda steam: (steam.plant_name_ferc1.str.contains(r"jeffrey.*8\%", regex=True, case=False)) & (steam.utility_id_ferc1==255),
    # Jeffrey Energy Center NOT getting split up when it should due to large variations in capacity
    "jeffrey_capacity": lambda steam: (steam.plant_name_ferc1.str.contains(r"jeffrey energy cntr", regex=True, case=False)) & (steam.utility_id_ferc1==255),
    # Belews Creek record for 2022 gets lost. The only change is gas/coal fuel ratio
    "belews": lambda steam: (steam.plant_name_ferc1.str.contains(r"belews", regex=True, case=False)),
    # Minor name changes split valmy 1&2 records.
    # NA fuel fractions in 2013 split valmy 1&2 records.
    "valmy12": lambda steam: (steam.plant_name_ferc1.str.contains(r"valmy.*1.*2", regex=True, case=False)),
    # Many cases of more than one record from the same year getting assigned the same plant_id_ferc1
    "valmy_duplicate_years": lambda steam: (steam.plant_name_ferc1.eq(r"valmy")),
    # Minor name change + null fuel fractions split this plant:
    "niles": lambda steam: (steam.plant_name_ferc1.str.contains(r"niles")) & (steam.plant_type.eq("combustion_turbine")),
    # Minor name change and flaky fuel categorization splits the plant
    "manatee": lambda steam: (steam.plant_name_ferc1.str.contains(r"manatee")) & (steam.plant_type.eq("steam")),
    # HB Robinson steam plant records get split due to whitepsace change in name, flaky capacity & fuel fraction reporting.
    "hb_robinson_steam": lambda steam: (steam.plant_name_ferc1.str.contains(r"h.*b.*robinson", case=False, regex=True)) & (steam.plant_type.eq("steam")),
}

Most of these cases involve somewhat minor changes in spelling, or abbreviations in the plant names. This is one type of change that we know to be quite common, and should certainly be testing for.

Simulating feature columns

For each feature column used in the model, we should roughly characterize how that column varies, and develop a strategy for simulating that variation.

[ ] plant_name_ferc1

Currently, the generative approach attempts to simulate plant names by applying random edits to the name. Specifically, for each record it generates it takes the "nominal" plant name, randomly selects a number of edits between 0 and k (k is configurable), then randomly selects a type of edit (add/delete/replace a character), and applies that edit. The number of edits is weighted towards 0, so it will sometimes apply k edits, but it is more likely to apply fewer, and when it adds or replaces a character, it can select a special character, or white space for the new character, but is more likely to select a letter. These edits seem somewhat inline with what we actually see, but the number of edits and weighting could be fine tuned.

[ ] plant_type

The current strategy for simulating plant type in the generative approach is to randomly select a different plant type from the set of categories for about 1% or records. This might not be the best approach, because there are cases where there actually are multiple plant types for plants with the same name, and these should end up in separate clusters. Maybe it would be better to just randomly nullify a small percentage of records.

[ ] construction_type

Same as plant type.

[ ] capacity_mw

Capacity is simulated by adding random noise to a subset of records. It does seem like there are cases where the capacity varies slightly around a nominal value like this, however in many cases it is pretty constant with occasional significant changes if the physical plant actually changes, so the current approach might be insufficient.

[ ] construction_year

Same as plant type.

[ ] utility_id_ferc1

Same as plant type.

[ ] fuel_fractions

To simulate fuel fractions, the generative test will create a random l1 unit vector (all of the fractions add up to 1), then apply random noise to the vector. This is certainly not representative of the actual data, given that plants don't just have a random distribution of fuel types. Possibly a better solution would be to randomly select a primary fuel source, then have some rules for when to select a secondary source.

catalyst-cooperative / pudl