catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

glue tests didn't capture all plants #2070

Open cmgosnell opened 1 year ago

cmgosnell commented 1 year ago

Describe the bug

A handful of plants from the hydro table snuck through the cracks of the glue tests and cause a FK failure when loading the full db.

2022-11-15 22:16:58 [    INFO] catalystcoop.pudl.load:70 Loading plants_hydro_ferc1 into PUDL SQLite DB.
2022-11-15 22:17:03 [    INFO] catalystcoop.pudl.load:80 [{'resource': 'plants_hydro_ferc1', 'foreign_key': ForeignKey(fields=['utility_id_ferc1', 'plant_name_ferc1'], reference=ForeignKeyReference(resource='plants_ferc1', fields=['utility_id_ferc1', 'plant_name_ferc1'])), 'invalid':        utility_id_ferc1 plant_name_ferc1
255                  51         williams
260                 168         columbia
271                 212         williams
481                  14
651                 204
1076                208
2046                393
2357                219
6703                252
7553                 38
9463                250
11898               404
15472               294
15773               281                 }]

Bug Severity

How badly is this bug affecting you?

To Reproduce

Remove the last three lines from pudl_id_mapping.xlsx and run pytest test/integration/glue_test.py --live-dbs --save-unmapped-ids on dbf-xbrl-mapping-dupes (or xbrl_integration once #2067 is merged). This should result in the failure of that test, but right now it passes.

cmgosnell commented 1 year ago

i figured out why this was happening... in the pudl.glue.ferc1_eia.get_plants_ferc1_raw, I was grabbing only the most recent year of data, as oppose to sorting the data by year and then dropping duplicates. I already fixed this in the eia plant-getter so I'm going to implement the same solution

Current code:

# grab the most recent plant record
    most_recent_year = max(all_plants.report_year)
    all_plants = (
        all_plants
        .loc[
            (all_plants.report_year == most_recent_year),
            [
                "utility_id_ferc1",
                "utility_name_ferc1",
                "plant_name_ferc1",
                "utility_id_ferc1_dbf",
                "utility_id_ferc1_xbrl",
                "capacity_mw",
                "report_year",
                "plant_table",
            ],
        ]
        .drop_duplicates(["utility_id_ferc1", "plant_name_ferc1"])
        .sort_values(["utility_id_ferc1", "plant_name_ferc1"])
    )
zaneselvans commented 1 year ago

@cmgosnell this was successfully fixed right? Should we close this issue?