Closed zschira closed 4 months ago
capacity_mw
For capacity_mw
I think there are two kinds of changes that happen, and as a human identifying plant time series, they're different in important ways. Sometimes the capacity of a plant will vary slightly as minor changes or refinements are made, but it's still the same set of generation units. Other times, whole new generation units will be added or old units will be retired, typically resulting in a capacity change that's a significant portion of the overall plant's capacity. When that kind of larger change happens, we should probably be disassociating the two time periods into two different plant_id_ferc1
values, because the two portions of the timeseries no longer represent directly comparable "plants".
construction_year
& installation_year
The construction_year
and installation_year
values can indicate similar plant composition changes, since (IIRC) construction_year
is the year when the oldest still active unit was put in service, while installation_year
is the year that the newest active unit was put in service. So when these numbers change it should represent a retirement or an addition. Seeing construction_year
increase while capacity_mw
decreases would likely indicate a retirement. And seeing installation_year
increase while capacity_mw
increases would likely indicate a new addition. Of course both could happen in the same year and make it difficult to understand the net effect, but in any case, that would probably not be a plant we'd consider directly comparable to the previous report_year
.
plant_type
& construction_type
Unfortunately, plant_type
and construction_type
are kind of garbage piles as they are initially reported -- they're freeform strings that we do our best to categorize as human beings, and they are not very specific, since a large fraction of the records share just a few values like steam
.
utility_id_ferc1
What's the case for allowing records with different utility_id_ferc1
values to be categorized within the same plant_id_ferc1
?
Purpose
We want to settle on a final validation metric to feel "good enough" about merging the FERC-FERC plant matching refactor (#3137) into PUDL. Because there is no ground truth to test on, we are exploring multiple possible testing strategies to give confidence in the model. The two strategies we are currently considering are generative testing (create fake data where we know what records should be matched and see how the model does), and metamorphic testing (apply matching to test dataset, mutate that dataset in ways the model should be able to handle, test again verifying results of the matching have not changed beyond a certain threshold). Both of these approaches require us to be able to accurately characterize the types of mutations present in the actual data, and to accurately simulate those mutations.
@zaneselvans found a number of cases where the model seems to not be mismatching records:
Most of these cases involve somewhat minor changes in spelling, or abbreviations in the plant names. This is one type of change that we know to be quite common, and should certainly be testing for.
Simulating feature columns
For each feature column used in the model, we should roughly characterize how that column varies, and develop a strategy for simulating that variation.
plant_name_ferc1
Currently, the generative approach attempts to simulate plant names by applying random edits to the name. Specifically, for each record it generates it takes the "nominal" plant name, randomly selects a number of edits between 0 and
k
(k
is configurable), then randomly selects a type of edit (add/delete/replace a character), and applies that edit. The number of edits is weighted towards 0, so it will sometimes applyk
edits, but it is more likely to apply fewer, and when it adds or replaces a character, it can select a special character, or white space for the new character, but is more likely to select a letter. These edits seem somewhat inline with what we actually see, but the number of edits and weighting could be fine tuned.plant_type
The current strategy for simulating plant type in the generative approach is to randomly select a different plant type from the set of categories for about 1% or records. This might not be the best approach, because there are cases where there actually are multiple plant types for plants with the same name, and these should end up in separate clusters. Maybe it would be better to just randomly nullify a small percentage of records.
construction_type
Same as plant type.
capacity_mw
Capacity is simulated by adding random noise to a subset of records. It does seem like there are cases where the capacity varies slightly around a nominal value like this, however in many cases it is pretty constant with occasional significant changes if the physical plant actually changes, so the current approach might be insufficient.
construction_year
Same as plant type.
utility_id_ferc1
Same as plant type.
fuel_fractions
To simulate fuel fractions, the generative test will create a random l1 unit vector (all of the fractions add up to 1), then apply random noise to the vector. This is certainly not representative of the actual data, given that plants don't just have a random distribution of fuel types. Possibly a better solution would be to randomly select a primary fuel source, then have some rules for when to select a secondary source.