catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Update defensive assertion for EIA Plant/Util ID mapping #1305

Open zaneselvans opened 2 years ago

zaneselvans commented 2 years ago

Near the end of pudl.glue.ferc1_eia.glue() we have a defensive assertion that checks for NA values in the dataframes containing unmapped EIA and FERC Plants and Utilities. However, with the 2020 ID mapping spreadsheet this assertion fails during the glue step of the ETL:

        # At this point there should be at most one row in each of these data
        # frames with NaN values after we drop_duplicates in each. This is because
        # there will be some plants and utilities that only exist in FERC, or only
        # exist in EIA, and while they will have PUDL IDs, they may not have
        # FERC/EIA info (and it'll get pulled in as NaN)

        for df, df_n in zip(
            [plants_eia, plants_ferc1, utilities_eia, utilities_ferc1],
            ['plants_eia', 'plants_ferc1', 'utilities_eia', 'utilities_ferc1']
        ):
            if df[pd.isnull(df).any(axis=1)].shape[0] > 1:
               raise AssertionError(
                    f"FERC to EIA glue breaking in {df_n}. There are too many null "
                    "fields. Check the mapping spreadhseet.")

# AssertionError: FERC to EIA glue breaking in plants_eia. There are too many null fields. Check the mapping spreadhseet.

I suspect that this is because there are a few pages of EIA plants that have plant_id_eia values but no plant_name_eia. The same problem comes up for utilities. For now this has been changed to a logger warning, and the ETL passes, but we should get an appropriate error check in here.

I also suspect that the lack of plant & utility names is probably due to the way we're dropping lots of harvestable fields before sending the data into the harvesting step (See #509).

Possible solutions

MichaelTiemannOSC commented 2 years ago

See https://github.com/catalyst-cooperative/pudl/issues/509?

zaneselvans commented 2 years ago

Ah yeah you're right #1232 is a closed duplicate. They just changed the "closed" color to purple instead of red so I didn't realize.