catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Stop dropping columns that should be harvested during transform #1232

Closed zaneselvans closed 3 years ago

zaneselvans commented 3 years ago

In many of our older EIA transformation functions, we pre-emptively drop columns from the tables that are being processed, in order to produce normalized tables. However, many of these columns contain information about the entities (plants, generators, utilities) that should be integrated into the entity harvesting and resolution process, which happens after the transform step. For example, in pudl.transform.eia923.generation_fuel() we have:

# This needs to be a copy of what we're passed in so we can edit it.
gf_df = eia923_dfs['generation_fuel'].copy()

# Drop fields we're not inserting into the generation_fuel_eia923 table.
cols_to_drop = ['combined_heat_power',
                'plant_name_eia',
                'operator_name',
                'operator_id',
                'plant_state',
                'census_region',
                'nerc_region',
                'naics_code',
                'eia_sector',
                'sector_name',
                'fuel_unit',
                'total_fuel_consumption_quantity',
                'electric_fuel_consumption_quantity',
                'total_fuel_consumption_mmbtu',
                'elec_fuel_consumption_mmbtu',
                'net_generation_megawatthours']
gf_df.drop(cols_to_drop, axis=1, inplace=True)

To have the broadest view of what attributes are associated with the various entities, we should probably be retaining many of these columns, and ensuring that they get assigned their canonical names, so they can be used as input into the harvesting process.

cmgosnell commented 3 years ago

i think this is a dupe of #509

zaneselvans commented 3 years ago

Whoops. Totally is a dupe.