catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Identify common mutations in FERC steam plants data #3176

Closed zschira closed 4 months ago

zschira commented 6 months ago

Purpose

We want to settle on a final validation metric to feel "good enough" about merging the FERC-FERC plant matching refactor (#3137) into PUDL. Because there is no ground truth to test on, we are exploring multiple possible testing strategies to give confidence in the model. The two strategies we are currently considering are generative testing (create fake data where we know what records should be matched and see how the model does), and metamorphic testing (apply matching to test dataset, mutate that dataset in ways the model should be able to handle, test again verifying results of the matching have not changed beyond a certain threshold). Both of these approaches require us to be able to accurately characterize the types of mutations present in the actual data, and to accurately simulate those mutations.

@zaneselvans found a number of cases where the model seems to not be mismatching records:

bad_plants = {
    # APS Yucca 3 misses the plant in 2021-2022 solely due to name change and a blip in fuel fraction splits. Everything else is consistent. Seems too sensitive.
    "yucca3": lambda steam: (steam.plant_id_pudl == 1013),
    # The columbia power station in Wisconsin seems genuinely complicated. It's a mishmash of different units with different owners.
    # Sometimes discontinuous years. Getting multiple records from the same year assigned the same plant_id_ferc1.
    # Sometimes big changes in the capacity within the same plant_id_ferc1.
    "columbia": lambda steam: (steam.plant_id_pudl == 124),
    # APS West Phoenix 5, data for 2021-2022 are not associated with other records due to different name.
    "west_phoenix_5": lambda steam: (steam.plant_name_ferc1.str.contains(r"west phoenix.*5", regex=True, case=False)),
    # sterling avenue plant is very consistent except for construction_type, which is usually null but has values in a couple of years,
    # so in those years it gets split off from all the other years, e.g. in 2003
    "sterling": lambda steam: (steam.plant_name_ferc1.str.contains(r"sterling", regex=True, case=False)),
    # Jeffrey Energy Center is getting split up badly due to minor changes in the name
    "jeffrey": lambda steam: (steam.plant_name_ferc1.str.contains(r"jeffrey", regex=True, case=False)) & (steam.utility_id_ferc1==255),
    "jeffrey_8pct": lambda steam: (steam.plant_name_ferc1.str.contains(r"jeffrey.*8\%", regex=True, case=False)) & (steam.utility_id_ferc1==255),
    # Jeffrey Energy Center NOT getting split up when it should due to large variations in capacity
    "jeffrey_capacity": lambda steam: (steam.plant_name_ferc1.str.contains(r"jeffrey energy cntr", regex=True, case=False)) & (steam.utility_id_ferc1==255),
    # Belews Creek record for 2022 gets lost. The only change is gas/coal fuel ratio
    "belews": lambda steam: (steam.plant_name_ferc1.str.contains(r"belews", regex=True, case=False)),
    # Minor name changes split valmy 1&2 records.
    # NA fuel fractions in 2013 split valmy 1&2 records.
    "valmy12": lambda steam: (steam.plant_name_ferc1.str.contains(r"valmy.*1.*2", regex=True, case=False)),
    # Many cases of more than one record from the same year getting assigned the same plant_id_ferc1
    "valmy_duplicate_years": lambda steam: (steam.plant_name_ferc1.eq(r"valmy")),
    # Minor name change + null fuel fractions split this plant:
    "niles": lambda steam: (steam.plant_name_ferc1.str.contains(r"niles")) & (steam.plant_type.eq("combustion_turbine")),
    # Minor name change and flaky fuel categorization splits the plant
    "manatee": lambda steam: (steam.plant_name_ferc1.str.contains(r"manatee")) & (steam.plant_type.eq("steam")),
    # HB Robinson steam plant records get split due to whitepsace change in name, flaky capacity & fuel fraction reporting.
    "hb_robinson_steam": lambda steam: (steam.plant_name_ferc1.str.contains(r"h.*b.*robinson", case=False, regex=True)) & (steam.plant_type.eq("steam")),
}

Most of these cases involve somewhat minor changes in spelling, or abbreviations in the plant names. This is one type of change that we know to be quite common, and should certainly be testing for.

Simulating feature columns

For each feature column used in the model, we should roughly characterize how that column varies, and develop a strategy for simulating that variation.

Currently, the generative approach attempts to simulate plant names by applying random edits to the name. Specifically, for each record it generates it takes the "nominal" plant name, randomly selects a number of edits between 0 and k (k is configurable), then randomly selects a type of edit (add/delete/replace a character), and applies that edit. The number of edits is weighted towards 0, so it will sometimes apply k edits, but it is more likely to apply fewer, and when it adds or replaces a character, it can select a special character, or white space for the new character, but is more likely to select a letter. These edits seem somewhat inline with what we actually see, but the number of edits and weighting could be fine tuned.

The current strategy for simulating plant type in the generative approach is to randomly select a different plant type from the set of categories for about 1% or records. This might not be the best approach, because there are cases where there actually are multiple plant types for plants with the same name, and these should end up in separate clusters. Maybe it would be better to just randomly nullify a small percentage of records.

Same as plant type.

Capacity is simulated by adding random noise to a subset of records. It does seem like there are cases where the capacity varies slightly around a nominal value like this, however in many cases it is pretty constant with occasional significant changes if the physical plant actually changes, so the current approach might be insufficient.

Same as plant type.

Same as plant type.

To simulate fuel fractions, the generative test will create a random l1 unit vector (all of the fractions add up to 1), then apply random noise to the vector. This is certainly not representative of the actual data, given that plants don't just have a random distribution of fuel types. Possibly a better solution would be to randomly select a primary fuel source, then have some rules for when to select a secondary source.

zaneselvans commented 6 months ago

capacity_mw

For capacity_mw I think there are two kinds of changes that happen, and as a human identifying plant time series, they're different in important ways. Sometimes the capacity of a plant will vary slightly as minor changes or refinements are made, but it's still the same set of generation units. Other times, whole new generation units will be added or old units will be retired, typically resulting in a capacity change that's a significant portion of the overall plant's capacity. When that kind of larger change happens, we should probably be disassociating the two time periods into two different plant_id_ferc1 values, because the two portions of the timeseries no longer represent directly comparable "plants".

construction_year & installation_year

The construction_year and installation_year values can indicate similar plant composition changes, since (IIRC) construction_year is the year when the oldest still active unit was put in service, while installation_year is the year that the newest active unit was put in service. So when these numbers change it should represent a retirement or an addition. Seeing construction_year increase while capacity_mw decreases would likely indicate a retirement. And seeing installation_year increase while capacity_mw increases would likely indicate a new addition. Of course both could happen in the same year and make it difficult to understand the net effect, but in any case, that would probably not be a plant we'd consider directly comparable to the previous report_year.

plant_type & construction_type

Unfortunately, plant_type and construction_type are kind of garbage piles as they are initially reported -- they're freeform strings that we do our best to categorize as human beings, and they are not very specific, since a large fraction of the records share just a few values like steam.

utility_id_ferc1

What's the case for allowing records with different utility_id_ferc1 values to be categorized within the same plant_id_ferc1?