catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Inconsistencies in the gens_eia860 output in the utility_id_eia column #1033

Closed cmgosnell closed 3 years ago

cmgosnell commented 3 years ago

Describe the bug

I found a number of generator records that are showing up in an annual pudl output object that have nulls in the utility_id_eia column. These should all be consistent because this column is a harvested column that is stored in the plants_eia860 table.

I have not investigated whether it is inconsistent in that table, but I suspect this is a merge issue in the output layer.

Bug Severity

How badly is this bug affecting you?

To Reproduce

Here is a code snippet I used to visualize the issue:

import pudl
import sqlachemy as sa

# get a pudl db engine
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings['pudl_db'])
# make the pudl output object
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine, freq='AS',)

idx_plant = ['plant_id_eia', 'report_date']
gens = pudl_out.gens_eia860()
test = gens.drop_duplicates(subset=idx_plant + ['utility_id_eia'])
test[test.duplicated(subset=idx_plant)].groupby(['report_date']).nunique()

this is my output from running this snippet in a notebook:

image

and extending off of the previous snippet:

test[test.duplicated(subset=idx_plant, keep=False)]
image
zaneselvans commented 3 years ago

Can you define what idx_plant is explicitly in your snippet?

It looks like utility_id_eia is stored directly in the generators_eia860 table in the PUDL DB as well as being stored in the plants_eia860 table which is interesting and maybe wrong? In generators_eia860 there are 1836 records (out of about 400,000) in which utility_id_eia is NA, 1778 all of which are coming from 2010. These rows seem to have NA values in most of their columns, as do the NA utility ID rows in the plants_eia860 table.

I wonder if this might be a bug in the entity resolution / normalization process, where we haven't been entirely explicit about what the primary keys are for these tables (as we found in the harvesting re-write with Ethan)?

cmgosnell commented 3 years ago

This is related to #1116, but more thoroughly described over there