catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Reconcile case and non-alphanumeric character differences in EIA generator_id values #984

Open zaneselvans opened 3 years ago

zaneselvans commented 3 years ago

A portion of the generator_id values used across the EIA 923 and EIA 860 datasets are inconsistent with respect to their use of UPPER and lower case letters, meaning that some generators in different years (and potentially different tables) should be associated with each other, but currently are not. It may also be the case that non-alphanumeric characters are having the same impact.

For example, if the following values of generator_id are found in association with the same plant_id_eia should they actually be treated differently? Or would we assume they are all referring to the same generator?

["A1", "a1", "A 1", "A-1"]

It turns out that there are a small number of generator ID values that have all of these problems (differentiated only by whitespace, by other non-alphanumeric characters, or by capitalization). To see what these look like...

def inconsistent_gen_ids(pudl_out):
    """
    Find generator IDs that only differ by case and non-alphanumeric characters.
    """
    gens_eia860 = pudl_out.gens_eia860()
    clean_gen_ids = (
        gens_eia860[["plant_id_eia", "generator_id"]].drop_duplicates()
        .assign(
            clean_gen_id=lambda x: x.generator_id.str.upper().str.replace(r"[^a-zA-Z0-9]", ""),
        )
    )

    dirty_gen_ids = (
        clean_gen_ids.drop_duplicates(subset=["plant_id_eia", "clean_gen_id"], keep=False)
    )
    delta = len(clean_gen_ids) - len(dirty_gen_ids)
    print(f"Found {delta} inconsistently reported generator_id values.")
    return (
        clean_gen_ids[clean_gen_ids.duplicated(subset=["plant_id_eia", "clean_gen_id"], keep=False)]
        .sort_values(["plant_id_eia", "clean_gen_id"])
    )

The obvious fix (akin to things we've done in other columns with inconsistently reported categorical values) is to apply some standardized formatting to the value in the columns. E.g. stripping out all non-alphanumeric characters and translating everything to UPPER CASE.

HOWEVER doing so in this case would make our IDs generally incompatible with the IDs being used by others to refer to the EIA data, even though it would be internally more consistent and easier to deal with. This would be a major inconvenience to users who are trying to merge the PUDL data with other data containing EIA generator IDs.

A less invasive approach here would be to fix only the IDs which have been reported inconsistently, of which there are a few dozen. These IDs are already inconsistent / broken, and if we strip non-alphanumeric characters and UPPER CASE the letters, in most cases we'll be changing an inconsistent value to match another value that was reported. Unfortunately, identifying all of the inconsistently reported values can't be done table by table, it has to happen after the harvesting process has gathered all of the different IDs together, and then the ID translation would need to be applied to any table containing the generator_id column.

But this in turn might introduce duplicate rows, breaking the primary key constraints. So it's not clear what the right solution is.

This may be something we try to get EIA to fix.

MichaelTiemannOSC commented 3 years ago

I'm new around here, but my vote would be to have PUDL use the normalized field in the nicely named table, and to have an OG table that has the OG field that nobody joins to unless they really want to. If EIA fix their data, the OG table quietly disappears.