Closed zaneselvans closed 1 year ago
The first iteration of this is to simply reimplement the same filling method used by the EIA API process, but using cached values from the bulk data instead of fetching them from the unreliable web API. Concretely, that means reimplementing pudl.output.eia923.get_fuel_cost_avg_eiaapi()
using the bulk data.
That output consists of a table of fuel costs aggregated by month, state, and fuel type. All values are for the "electric power (total)" sector (utilities + IPPs). That information is easily available in the bulk data.
The only significant difference I see is that the get_fuel_cost_avg_eiaapi
fuel costs are given per unit (though immediately processed into per MMBTU in pudl.output.eia923.fuel_receipts_costs_eia923()
) , whereas the bulk data is per MMBTU from the start. So a few lines of processing code in fuel_receipts_costs_eia923
must be changed to account for this difference.
The fuel costs from the API imputation can be quite different compared to the costs from the bulk data (see scatter plot below. It is a log-log plot of API (x axis) vs bulk (y axis) fuel costs per mmBTU. Both axes are log10 scale). Note that the exact same rows are imputed in both cases -- the differences are in the imputed values themselves.
I think this is because the current API code ingests aggregate per-unit fuel costs and converts them to per-mmbtu by dividing by row-level fuel_mmbtu_per_unit
. The existing bulk data method uses aggregate per-mmbtu fuel costs directly, which stays at the aggregate level. It is possible to extract the per-unit data series from the bulk data and use either process (we would need to change the bulk data ETL code).
I think the question of "which is better" belongs with the second stage of this work (not in scope for this issue) about improving the imputation model & methodology. For now I have used the direct, aggregate fuel_cost_per_mmbtu
from the bulk data.
Closed by #1998 with minor scope change: rather than using multiple granularities of aggregates and filling missing values from most to least granular, we use a single level. This single level mimics the behavior of the pre-existing method. Further refinement is left for further imputation model development such as issue #1766
Following on #1764, re-implement the filling of missing fuel prices in the output / analysis functions using the appropriate aggregated fuel prices that have been collected from the EIA API, and which should now be available from the PUDL DB.
pudl.metadata.dfs.SECTOR_AGGREGATE_ASSN
andpudl.metadata.dfs.FUEL_AGGREGATE_ASSN