Implement fuel price filling using API aggregations

catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

https://catalyst.coop/pudl

MIT License

456 stars 106 forks source link

Implement fuel price filling using API aggregations #1765

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

Following on #1764, re-implement the filling of missing fuel prices in the output / analysis functions using the appropriate aggregated fuel prices that have been collected from the EIA API, and which should now be available from the PUDL DB.

[x] Define and load new static association tables based on pudl.metadata.dfs.SECTOR_AGGREGATE_ASSN and pudl.metadata.dfs.FUEL_AGGREGATE_ASSN
[ ] Order the available fuel price aggregates from most to least specific, so more precise aggregations can be prioritized.
[x] Iteratively use the stored aggregates to fill in missing fuel prices based on time, location, fuel type, and industrial sector.
[x] Compare the aggregated fuel prices available from the API to our best guesses of what the aggregates should be for times/places/fuels/sectors where we have a lot of data, to verify that nothing crazy is happening.

TrentonBush commented 1 year ago

The first iteration of this is to simply reimplement the same filling method used by the EIA API process, but using cached values from the bulk data instead of fetching them from the unreliable web API. Concretely, that means reimplementing pudl.output.eia923.get_fuel_cost_avg_eiaapi() using the bulk data.

That output consists of a table of fuel costs aggregated by month, state, and fuel type. All values are for the "electric power (total)" sector (utilities + IPPs). That information is easily available in the bulk data.

The only significant difference I see is that the get_fuel_cost_avg_eiaapi fuel costs are given per unit (though immediately processed into per MMBTU in pudl.output.eia923.fuel_receipts_costs_eia923()) , whereas the bulk data is per MMBTU from the start. So a few lines of processing code in fuel_receipts_costs_eia923 must be changed to account for this difference.

TrentonBush commented 1 year ago

The fuel costs from the API imputation can be quite different compared to the costs from the bulk data (see scatter plot below. It is a log-log plot of API (x axis) vs bulk (y axis) fuel costs per mmBTU. Both axes are log10 scale). Note that the exact same rows are imputed in both cases -- the differences are in the imputed values themselves. loglog_fuel_prices

I think this is because the current API code ingests aggregate per-unit fuel costs and converts them to per-mmbtu by dividing by row-level fuel_mmbtu_per_unit. The existing bulk data method uses aggregate per-mmbtu fuel costs directly, which stays at the aggregate level. It is possible to extract the per-unit data series from the bulk data and use either process (we would need to change the bulk data ETL code).

I think the question of "which is better" belongs with the second stage of this work (not in scope for this issue) about improving the imputation model & methodology. For now I have used the direct, aggregate fuel_cost_per_mmbtu from the bulk data.

TrentonBush commented 1 year ago

Closed by #1998 with minor scope change: rather than using multiple granularities of aggregates and filling missing values from most to least granular, we use a single level. This single level mimics the behavior of the pre-existing method. Further refinement is left for further imputation model development such as issue #1766