catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
470 stars 108 forks source link

Retain all harvestable fields during EIA transforms #509

Open zaneselvans opened 4 years ago

zaneselvans commented 4 years ago

In many of our older EIA transformation functions, we preemptively drop columns from the tables that are being processed, in order to produce normalized tables. However, many of these columns contain information about the entities (plants, generators, utilities) that should be integrated into the entity harvesting and resolution process, which happens after the transform step.

Discarded Columns

EIA-860

pudl.transform.eia860.ownership()

pudl.transform.eia860.generators()

pudl.transform.eia860.plants()

pudl.transform.eia860.utilities()

EIA-923

pudl.transform.eia923.plants()

pudl.transform.eia923.generation_fuel()

pudl.transform.eia923.boiler_fuel()

This one may give you trouble. See #1847 and #1836.

pudl.transform.eia923.generation()

pudl.transform.eia923.coalmine()

pudl.transform.eia923.fuel_receipts_costs()

zaneselvans commented 2 years ago

@cmgosnell and I are going to help get @knordback working on this issue as a way to become more familiar with the harvesting process, working with our code, Jupyter, etc.

zaneselvans commented 1 year ago

@cmgosnell while talking over some of these fields with @knordback yesterday, I noticed that the associated_combined_heat_power field is part of the generators_entity_eia table, but there's another combined_heat_power field being reported in e.g. the generation_fuel_eia923 table, and looking at the spreadsheets, it seems like that field pertains to the plant (which makes some sense given that generation_fuel_eia923 is reported on a date, plant, prime-mover, fuel basis).

Are these different attributes? Should there be a CHP field at both the generator and the plant level? Should this really be a permanent attribute, or is it another one that changes slowly? Does the generator field really just indicate that the generator is part of a plant that does CHP? Or that it's part of a generation unit that does CHP? Could the plant or plant-prime-fuel level CHP status be inferred from the generator-level CHP attributes?

Right now we're discarding the CHP column reported in generation_fuel_eia923.

@grgmiller or @gschivley do either of you have more context on the relationship between these two different CHP fields?

cmgosnell commented 1 year ago

I don't know exactly. associated_combined_heat_power originates in the generator table. I would not be surprised if there were plants that had some units contributing to a CHP and some that just generated power. I don't think it's generally a good idea to base any logic about the workings of a plant based off of the reporting structure of the generation_fuel_eia923 table. I personally would check whether this value is actually consistent across all generators within a plant before thinking about moving it. But also i could definitely imagine this changing over time (albeit very rarely!).

zaneselvans commented 1 year ago

It seems like we should probably do an exhaustive check of all the currently "permanent" generator attributes on the pre-harvested dataframes... and see how permanent they actually are.

grgmiller commented 1 year ago

I do not have any context on these two fields.

knordback commented 1 year ago

I'll hold off on this one for now.

knordback commented 1 year ago

I think this is mostly done. Based on notes above I left in code dropping some of the fields in clean_generation_fuel_eia923() and clean_fuel_receipts_costs_eia923(), but I'm not certain I'm interpreting the notes correctly. There's also implicit dropping in plants_eia923(), and I don't know if that's as desired or not.