catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

gen_eia923() outputs have many fewer records with fill_net_gen=True #861

Closed zaneselvans closed 2 years ago

zaneselvans commented 3 years ago

Aggregating net generation by month, but without filling in any missing data, we get 476k records, of which 451k have non-null net_generation_mwh values:

pudl_out_ms = pudl.output.pudltabl.PudlTabl(
    pudl_engine=pudl_engine,
    freq='MS',
    fill_fuel_cost=False,
    roll_fuel_cost=False,
    fill_net_gen=False,
)
gen_eia923_ms = pudl_out_ms.gen_eia923()
gen_eia923_ms.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 476052 entries, 0 to 476051
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   report_date         476052 non-null  datetime64[ns]
 1   plant_id_eia        476052 non-null  Int64         
 2   plant_id_pudl       476052 non-null  Int64         
 3   plant_name_eia      476052 non-null  object        
 4   utility_id_eia      476052 non-null  Int64         
 5   utility_id_pudl     476052 non-null  Int64         
 6   utility_name_eia    476052 non-null  object        
 7   generator_id        476052 non-null  object        
 8   net_generation_mwh  451368 non-null  float64       
dtypes: Int64(4), datetime64[ns](1), float64(1), object(3)
memory usage: 38.1+ MB

While aggregating net generation by month, and attempting to fill in generator level net_generation_mwh by using plant/fuel/prime mover based net generation data from the generation_fuel_eia923 table, we end up with only 270k records, of which 226k have non-null net_generation_mwh values:

pudl_out_fill = pudl.output.pudltabl.PudlTabl(
    pudl_engine=pudl_engine,
    freq='MS',
    fill_fuel_cost=True,
    roll_fuel_cost=True,
    fill_net_gen=True,
)
gen_eia923_filled = pudl_out_fill.gen_eia923()
gen_eia923_filled.info()
RangeIndex: 270609 entries, 0 to 270608
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   plant_id_eia         270609 non-null  int64         
 1   generator_id         270609 non-null  object        
 2   report_date          270609 non-null  datetime64[ns]
 3   net_generation_mwh   226256 non-null  float64       
 4   fuel_consumed_mmbtu  226256 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 10.3+ MB

So somehow we're both losing a large number of apparently available per-month per-generator records, and ending up with a larger proportion of the remaining records having NA values, which seems surprising.

cmgosnell commented 2 years ago

done