catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

investigate eia860 column maps #763

Closed cmgosnell closed 2 years ago

cmgosnell commented 3 years ago

While running an etl w/ all years (2008 - 2018) I'm getting these warnings about the columns from the extracted data not matching up with the column maps:

2020-09-25 10:09:14 [ WARNING] pudl.extract.excel:222 Columns for boiler_generator_assn are off: should be 4 but got 9
2020-09-25 10:12:30 [ WARNING] pudl.extract.excel:222 Columns for generator_existing are off: should be 76 but got 77
2020-09-25 10:15:18 [ WARNING] pudl.extract.excel:222 Columns for generator_proposed are off: should be 55 but got 56
2020-09-25 10:17:59 [ WARNING] pudl.extract.excel:222 Columns for generator_retired are off: should be 75 but got 76
2020-09-25 10:18:07 [ WARNING] pudl.extract.excel:222 Columns for ownership are off: should be 14 but got 15
2020-09-25 10:18:44 [ WARNING] pudl.extract.excel:222 Columns for plant are off: should be 46 but got 48
2020-09-25 10:18:51 [ WARNING] pudl.extract.excel:222 Columns for utility are off: should be 20 but got 21

Note: this is not necessarily a failure. I incorporated these warnings into the extract step mostly for 861 because we are mapping with locations instead of strings.

I did a quick investigation of the boiler_generator_assn table and the non-mapped columns were a Steam Plant Type which is a code from the most recent years which.. doesn't seem super useful on the face of it. And an old column which is basically an observed or "theoretical" boolean.

Steam plant type description from 860: 1 = Plants with combustible-fueled steam-electric generators with a sum of 100 MW or more steam-electric nameplate capacity (including combined cycle steam-electric generators with duct firing). 2 = Plants with combustible-fueled steam-electric generators with a sum of 10 MW or more but less than 100 MW steam-electric nameplate capacity (including combined cycle steam-electric generators with duct firing). 3 = Plants with nuclear fueled generators, combined cycle steam-electric generators without duct firing and solar thermal electric generators using a steam cycle with a sum of 100 MW or more steam-electric nameplate capacity. 4 = Plants with non-steam fueled electric generators (wind, PV, geothermal, fuel cell, combustion turbines, IC engines, etc.) and electric generators not meeting conditions of categories above.

zaneselvans commented 3 years ago

I think we should probably go ahead and map all the columns even if they seem not particularly useful, since we don't really understand all the possible use cases, and we're trying to provide programmatic access to the underlying dataset... whatever it was. It might be a little tedious but it shouldn't be too difficult, should it?

zaneselvans commented 2 years ago

@stevenbwinter @swinter2011 here's an issue related to the unmapped columns...

cmgosnell commented 2 years ago

this feels very very separable from getting the er data integrated - which is a high priority - so I'd propose we wait. doing this in tandem would entangle two tasks and make the higher priority task slower to accomplish. we can and should put this on the docket right after the early release stuff gets integrated.

cmgosnell commented 2 years ago

also here are the new list of columns from the latest etl readout:

2022-08-04 12:49:16 [ WARNING] pudl.extract.excel:260 Extra columns found in page boiler_generator_assn: {'generator_association', 'plant_name', 'steam_plant_type', 'utility_name'}
2022-08-04 12:49:36 [ WARNING] pudl.extract.excel:260 Extra columns found in page generator: {'fercdock', 'winter_capacity', 'summer_capacity', 'fercother', 'fercewgdoc', 'planned_derates_net_summer_cap', 'ferccogen'}
2022-08-04 12:51:04 [ WARNING] pudl.extract.excel:260 Extra columns found in page generator_existing: {'planned_energy_source_1'}
2022-08-04 12:51:10 [ WARNING] pudl.extract.excel:260 Extra columns found in page generator_proposed: {'winter_estimated_capacity', 'winter_capacity', 'summer_capacity', 'summer_estimated_capacity'}
2022-08-04 12:51:56 [ WARNING] pudl.extract.excel:260 Extra columns found in page plant: {'ferc_exempt_wholesale_generator_docket_number', 'ownertransdist'}
2022-08-04 12:52:04 [ WARNING] pudl.extract.excel:260 Extra columns found in page utility: {'areacode'}