catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
453 stars 106 forks source link

Investigate lack of monthly year-to-date data in out_eia923__monthly_generation_fuel_by_generator table #3634

Open zaneselvans opened 1 month ago

zaneselvans commented 1 month ago

In #3625 it seemed odd that there was no 2023 data showing up in the out_eia923__monthly_generation_fuel_by_generator table, even with 11 months of 2023 incremental_ytd records from the EIA-923:

gen_eia923_ms = pd.read_sql("out_eia923__monthly_generation", pudl_engine)
gen_eia923_ys = pd.read_sql("out_eia923__yearly_generation", pudl_engine)
gf_by_gen_eia923_ms = pd.read_sql("out_eia923__monthly_generation_fuel_by_generator", pudl_engine)
gf_by_gen_eia923_ys = pd.read_sql("out_eia923__yearly_generation_fuel_by_generator", pudl_engine)
frc_eia923 = pd.read_sql("out_eia923__monthly_fuel_receipts_costs", pudl_engine)

print(f"gen MS: {gen_eia923_ms.report_date.max()}")
print(f"gen YS: {gen_eia923_ys.report_date.max()}")
print(f"gen fuel by gen MS: {gf_by_gen_eia923_ms.report_date.max()}")
print(f"gen fuel by gen YS: {gf_by_gen_eia923_ys.report_date.max()}")
print(f"frc MS: {frc_eia923.report_date.max()}")

# gen MS: 2024-12-01 00:00:00
# gen YS: 2023-01-01 00:00:00
# gen fuel by gen MS: 2022-12-01 00:00:00
# gen fuel by gen YS: 2022-01-01 00:00:00
# frc MS: 2024-02-01 00:00:00

This seems a little bit fishy. We use pudl.output.eia923.drop_ytd_for_annual_tables() to avoid "annual" aggregations of data where we don't have a whole year of data, but here it seems like we're also somehow excluding monthly year to date records, which I don't think is intentional? And drop_ytd_for_annual_tables() does not get called when freq=="MS"

Investigate why this truncation is happening, and evaluate whether that's the expected / desired behavior.

Possible explanation

The out_eia923__monthly_generation_fuel_by_generator table depends on the fuel & generation allocation process, which depends on the boiler generator association table, and that table is only available from the annual EIA-860, not the monthly EIA-860M data, so the fact that we don't have the allocated generation & fuel table for periods in which there's only EIA-860M data right now makes sense.

If we wanted to hack it to give us some estimate of the most recent allocated data we could just forward fill the BGA table up to the most recent year, and it would be mostly right since these associations don't really change unless there's a major overhaul to a plant, but we're not doing that now.