catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Remove utility_id_eia from generators_eia860 & ownership_eia860 #1266

Open zaneselvans opened 2 years ago

zaneselvans commented 2 years ago

It turns out that the value of utility_id_eia associated with an EIA generator is entirely determined by which plant the generator is part of -- i.e. all plants have a single operator.

gens_eia860 = pd.read_sql("generators_eia860", pudl_engine)
gens_eia860.groupby(["report_date", "plant_id_eia"]).utility_id_eia.nunique().value_counts()
1    155554
0       667
Name: utility_id_eia, dtype: int64

This means that keeping the utility_id_eia column in the generators_eia860 and ownership_eia860 tables is redundant, and poorly normalized. It should be dropped during the harvesting process and removed from the table definitions. However this isn't currently possible within the ye olde harvesting process, so it will need to wait until after we switch to the new one.

zaneselvans commented 1 year ago

@knordback This issue is adjacent to the changes you made recently on the #509 branch.