Explore external validation sources for fuel imputation

TrentonBush commented 2 years ago

Our fuel imputations, regardless of how great of a model we can make, are limited by the fact that missing data is unobserved and uncertain. We have a couple of sources of potential external validation to help bound that uncertainty and validate the imputations. But they are higher level aggregates that will take some translation to use.

could look at fuel costs/quantities from FERC Form 1 for comparison. They give annual figures. Could use for aggregate validation.
reverse engineer the EIA API and see if they include redacted data in their state-month-fuel averages. Try to dis-aggregate as much as possible. Could use that for ultimate validation.

zaneselvans commented 2 years ago

I poked around at the other fuel categories available in the EIA API, and if IIRC it only provided:

monthly fuel_group_eiaepm prices at the state level.
monthly fuel_group_eiaepm prices at census region level.
monthly energy_source_code prices at the national level.

If we attempt to reverse engineer the calculation that's feeding into the EIA API would we end up at one of these endpoints?

We can reproduce exactly their numbers, in which case we know how it works, but there's no additional information about the redacted prices available.
We can't reproduce their numbers exactly, in which case we can't tell if we just don't know how it works, or it contains additional information from the redacted data.

But maybe we can find the actual methodology and implement it, or ask them directly whether there's any additional information about the redacted prices aggregated in there.

I think we'll struggle to wring much useful information out of the FERC 1 here for a few reasons:

It's annual
It only covers a subset of the EIA plants
Our mapping from FERC to EIA plant IDs is probably not perfect
The fuel is reported using messy freeform strings which we've made our best guess at categorizing into coal / oil / gas, and that's probably not perfect.
Even if we got the categorizations correct based on the freeform strings, our categorizations don't line up with the EIA fuel categories exactly.

TrentonBush commented 2 years ago

What I'm hoping happens with EIA data is that we can exactly reproduce their numbers for state-months that happen to have no missing data. This confirms that we understand the methodology. Then for non-matching ones, we can subtract out the data we have and infer that the differences belong to the missing values. The "monthly energy_source_code prices at the national level" could possibly help disaggregate further.

Then we can compare our imputed values, aggregated to the same level, to the EIA values. I don't think we can train a model on metrics that coarse, but it will be helpful to check for overall bias.

TrentonBush commented 2 years ago

Validity Concerns & Path Forward

IPPs Don't Reveal Their Costs

Missing cost data is essentially binary by plant - a given plant either reports everything or nothing. 47% of plants fall into each category, with only 6% having a mixture of null and not-null. The problem is that plants are not missing at random: almost no IPPs reveal their cost data. There are two problems with this:

Is it valid to assume IPPs pay the same prices as utilities? We can't test this without alternative data sources such as EIA EPM.
Geographic clustering of IPPs (see below) will force imputation models to extrapolate, harming accuracy.
IPPs are Clustered Geographically

After fuel type and time, location is one of the most impactful variables that determine fuel costs. If location data is not available, prediction accuracy will suffer. IPPs are clustered in certain areas: the Northeast, Texas, and California. These areas have few or even zero utility owned plants, which means there is very little cost data from nearby plants. This means that our imputation models will not be able to use geographic proximity very well. The states of Pennsylvania and Maine, for example, have zero plants that report fuel costs. ISO-NE, NYISO, PJM, ERCOT, and CAISO all have several times more IPP plants than utility plants.

Good News: EIA Includes IPPs in Aggregates

They redact aggregates that are too specific, but they do include the data in others. So we can use this to test, on an aggregate level, our assumption about IPPs paying similar rates to utilities and to test that our model can extrapolate geographically.

zaneselvans commented 2 years ago

Wow, half of plants report no price data?! I had no idea. It's only about 1/3 of the total heat content. I wonder if that's explained by the IPP plants being smaller, more efficient, or having lower capacity factors?

Even if the best we can do is to demonstrate that our estimates are consistent with the aggregated EIA data which includes redacted prices, I think there's a lot of value in making complete and reasonable prices available.

Also from an advocacy point of view, I think the non-IPP plants are much more likely to be targeted, since (in theory) the IPPs already have a clear incentive not to run if they're economically uncompetitive, so having less certainty on the IPP fuel prices is probably not horrible.

Another aggregated sanity check could be to look at the monthly per-state, per-fuel prices that EIA reports which are somewhat distinct from the purchase prices. For NG at least it's reported for the electric sector as a whole. Hopefully these prices would be available in a different part of the API. It looks like the coal price data is broken down by state / census region / coal basin, as well as market type (captive, open, and total). Though I guess maybe the location of sale and the location of consumption could be different, so maybe this would only be accurate for fuel deliveries where the mine state and plant state (or census region, or coal basin are the same.

zaneselvans commented 2 years ago

I made some 2-d histograms / heatmaps to compare the reported and predicted fuel prices, and was surprised at how much dispersion there is. The 1-d distributions can look very similar, even though there doesn't seem to be great correlation between the individual reported & predicted values. E.g:

(I was using a scatterplot before, but it was extremely slow with hundreds of thousands of points to plot)

TrentonBush commented 2 years ago

is the above plot basically y_train vs model.predict(x_train)?

zaneselvans commented 2 years ago

Yes that's right. Lots more (1D) plots over here #1720

TrentonBush commented 2 years ago

Reproducing EIA's Aggregates

To validate our imputations against EIA's aggregates, we first have to confirm that we understand what their aggregates are. EIA provides aggregate prices for state-month-fuel-sector groups. Errors can arise from any of those group definitions or from the aggregate metric itself or possibly from differences in data processing. So far, after filtering for groups with 100% data availability in the FRC table, I have been able to exactly reproduce their aggregates for only 72% of groups. Natural gas has a particularly large discrepancy, with a 4 cent bias (FRC < EIA API). I suspect the remaining error comes from either data processing or sector definitions.

A brief point on "exact" matches: EIA data is provided with 2 decimal places of precision (0.01). I consider values within half a unit-in-the-last-place (so 0.005) to be "exact" matches. This accounts for rounding error.

Aggregation: MMBTU-weighted average price

I am very confident in the aggregate metric because, if it was wrong, there is no way 72% of values would be exactly correct.

Group Definitions

State - hard to imagine error coming from the definition of states
month - hard to imagine error coming from the definition of months
fuel - the FRC table comes with coarse fuel categories that match the EIA API data. I think it is unlikely that EIA 923 uses mismatched categories to other EIA data.
sector - there is some ambiguity here.
- In FRC, 28,178 records (5%) are missing a sector category.
- Sectors change over time, but PUDL models them as static values. This can cause PUDL and EIA to diverge. For reference, 240/6197 plants (3.9%) in 2013 have a different sector in 2020 (excluding retirements). [This is a separate issue but the static "entities" table bugs me! There are established ways to model "slowly changing dimensions" like this.]
  Processing
  
  EIA may revise values, remove outliers, etc that impact the aggregates. But it seems unlikely to me that 28% of values would be impacted by this.

Error Metrics and Plots

The following is for the "Electricity Sector" defined as regulated utiliities plus IPPs plus NaN sectors. The state-month-fuel-sector groups have been filtered to those with 100% data availability (zero imputation). This filtering effectively excludes all data with any IPP, industrial, or commercial fuel purchases. Error is defined as EIA aggregates minus FRC aggregates.

Error by Fuel

Recall that the numerical precision is 0.01, so errors less than 0.005 are essentially zero.

fuel_group_eia | mae | mse | med_ae | mape | n_samples | n_exact | frac_exact | mean_bias | median_bias -- | -- | -- | -- | -- | -- | -- | -- | -- | -- coal | 0.0093 | 0.00083 | 0.00304 | 0.0044 | 2576 | 2053 | 0.7969 | -0.0051 | -0.000513 natural_gas | 0.2068 | 0.8084 | 0.0040 | 0.0432 | 1119 | 699 | 0.6246 | 0.0429 | -0.000092 petroleum | 0.1111 | 0.3362 | 0.0036 | 0.0066 | 2702 | 1787 | 0.6613 | -0.0022 | 0.000000 petroleum_coke | 0.0385 | 0.0158 | 0.0030 | 0.0169 | 704 | 589 | 0.8366 | -0.0352 | -0.000479

TrentonBush commented 2 years ago

Imputation Validation

Using the same method as above, I looked at the aggregate accuracy of modeled prices for the electricity sector (IPP + utility + NaN). The particular model used here produced large biases, particularly of gas and oil prices.

Baseline: look at training data only

Again filter for groups with 100% data availability, but instead of aggregating measured values I aggregate modeled ones. This shows the interpolation performance of the model.

Error by Fuel

fuel_group_eia | mae | mse | med_ae | mape | n_samples | n_exact | frac_exact | mean_bias | median_bias -- | -- | -- | -- | -- | -- | -- | -- | -- | -- coal | 0.0782 | 0.0917 | 0.0426 | 0.0324 | 2576 | 188 | 0.0729 | 0.0218 | 0.0038 natural_gas | 0.7442 | 2.3192 | 0.3578 | 0.1402 | 1119 | 11 | 0.0098 | 0.5523 | 0.2113 petroleum | 1.1690 | 3.0009 | 0.8075 | 0.0754 | 2702 | 14 | 0.0051 | 0.2240 | -0.0095 petroleum_coke | 0.2203 | 0.1090 | 0.1326 | 0.1578 | 704 | 32 | 0.0454 | -0.0533 | -0.0279

Pure Imputation Performance

Now I filter for pure imputation: groups with zero measured data. This shows extrapolation performance. Note that there are many state-month-fuel-sector groups that are neither pure imputation nor pure measurement -- they are not included here. This sample is regionally concentrated in IPP

Error by Fuel

fuel_group_eia | mae | mse | med_ae | mape | n_samples | n_exact | frac_exact | mean_bias | median_bias -- | -- | -- | -- | -- | -- | -- | -- | -- | -- coal | 0.2642 | 0.1320 | 0.2091 | 0.0911 | 369 | 3 | 0.0081 | 0.0334 | -0.0149 natural_gas | 0.9959 | 3.6346 | 0.5497 | 0.2260 | 611 | 2 | 0.0032 | 0.3462 | -0.0396 petroleum | 1.5314 | 3.8381 | 1.2834 | 0.0961 | 345 | 0 | 0.0000 | -0.2761 | -0.6542 petroleum_coke | 1.0799 | 1.5611 | 1.0323 | 0.6176 | 42 | 0 | 0.0000 | -1.0799 | -1.0323

Mixed Imputation Performance

Now I filter for mixed imputation: groups with between zero and 100% measured data, exclusive.

Error by Fuel

There is no petcoke in this sample.

fuel_group_eia | mae | mse | med_ae | mape | n_samples | n_exact | frac_exact | mean_bias | median_bias -- | -- | -- | -- | -- | -- | -- | -- | -- | -- coal | 0.1247 | 0.0360 | 0.0721 | 0.0597 | 976 | 44 | 0.0450 | -0.0381 | -0.0118 natural_gas | 0.5100 | 2.1763 | 0.2041 | 0.0970 | 2524 | 52 | 0.0206 | 0.2068 | 0.0007 petroleum | 1.3426 | 3.6684 | 0.9492 | 0.0783 | 645 | 2 | 0.0031 | 0.5835 | 0.3099

zaneselvans commented 2 years ago

It's not hard to switch a plant attribute from being static to annually varying. If 4% of plants changed sector between 2013 and 2020 that should definitely be in the annually varying table, not the static table. Our initial categorization of these things happened a long time ago and we've added a lot more years of data since then. We should probably go back over them and move anything questionable into the annual table. Do you think that's likely to be enough to address the mismatch between the API and our aggregated values?

Could we reasonably back/forward fill the plant sectors if they were treated as a time varying attribute? Or when sector is missing for a plant, is it missing for all time? Or maybe the missing sectors are partly a result of "inconsistent" values that were nulled out due to real variability inappropriately.

It seems like natural gas is the biggest source of variability / error.

Are you clipping the crazy fuel prices (like > $1000/mcf gas?) before doing the aggregations?

What do you think of the performance overall? A 22% MAPE for natural gas in the pure imputation case seems kind of big.

Is it possible to set some kind of aggregate constraint on the model, to require (or incentivize) it to match the aggregated values as well as the reported per-plant prices? If it was possible would we expect it to actually improve the model?

zaneselvans commented 2 years ago

Some basic facts

The EIA API seems flaky, and annoying to have integrated into our processes. It's also an impediment to folks setting up the system.
The EIA API also contains information about the redacted fuel prices (in the aggregate values) which is not available to us in the reported granular data. So in eschewing the API and attempting to estimate prices only from the public data, we do lose information.
The fact that the missingness is extremely non-random (with 47% of plants having no price information, and being geographically clustered, and only 6% of plants having partial data) means that efforts to estimate the missing values accurately -- and know that they are at all accurate -- are intrinsically challenging.

Some Options...

Accept that our estimates of missing fuel prices will be more precise, but less accurate than what's available from the API.
Retain the API integration and the challenges it presents, migrating to the EIA's v2 API.
Archive data from the API so that it can be used in a reliable / reproducible way just like any other data source.
Integrate information from the API into the model that does the estimation, so that we can ensure our estimated prices satisfy the aggregate MMBTU weighted prices for each combination of state, fuel, month, and sector, while also estimating variation from that aggregate value on the basis of other per-plant factors.
Integrate the fuel price filling / estimating process into the ETL so that it just happens once during the build of the database, rather than having lots of users need to run it. So we'd have to deal with the flakiness and API key setup, but nobody else would.

TrentonBush commented 2 years ago

Do you think [sector changes are] likely to be enough to address the mismatch between the API and our aggregated values?

I doubt that impacts 28% of records, but I don't really see any single "smoking gun" factor here.

when sector is missing for a plant, is it missing for all time?

Currently yes, because they are static attributes. I have not explored the time varying sectors beyond my comparison of 2013 to 2020, but I would expect forward/backward filling to work just fine.

Are you clipping the crazy fuel prices (like > $1000/mcf gas?) before doing the aggregations?

Yes, which would contribute to the modeled bias if EIA leaves them unchanged. But I think you mentioned there are only 200-ish such values, so I doubt that can explain 28% of aggregates.

What do you think of the performance overall? A 22% MAPE for natural gas in the pure imputation case seems kind of big.

I agree, it does seem big. That may be because the "pure imputation" cases are clustered in the Northeast (I think, need to check), which I vaguely remember hearing is a tight gas market compared to the rest of the country, at least in winter. With no training data there, I'm not sure any model is going to perform super well.

Is it possible to ... incentivize [the model] to match the aggregated values as well as the [training data]? ... would we expect it to actually improve the model?

Yes it is possible via either a custom loss function or maybe by somehow including EIA aggregates into the training data (like via imputation). It would improve the model in the sense of nudging the predictions towards the aggregates. There may be unwanted side effects like reducing real variation within the state-month-fuel-sector groups for groups with mixed reported/imputed values (which are more numerous but less impactful than pure imputations).

TrentonBush commented 2 years ago

I looked into error in more detail and found a few things:

8-11 states contain the lions share of error
Large errors are more common in but are not limited to states with low data availability (denoted by bar color below)
Generally speaking, the worst performing states have positive error (model undershoots), while most of the rest have negative error (model overshoots).

This chart is a little busy, but the pieces are:

blue dots show P25 and P75 error (IQR)
red dot shows MAE (states are sorted by this)
bar length (dual axis on bottom) shows number of months in that EIA state-fuel group
bar color shows the average fraction of MMBTUs that were imputed for that state (notice that darker bars are clustered toward high error, as you would expect)

And the same plot as above, but filtered for states with zero imputed values. These are states where the FRC and EIA diverge, regardless of the model. I think this may be because of outlier handling (clipping) or non-static sector assignments.

TrentonBush commented 2 years ago

Ah ya this is where our regional data availability is a problem 😕. The Northeast has a unique seasonality in gas prices that other regions simply don't have. Because the Northeast has little to no presence in the FRC data (our training data), I don't think it is possible for any model to reproduce this using only FRC data. We would have to use EIA aggregates somehow if we want to capture that.

zaneselvans commented 2 years ago

Hmm, and it looks like the seasonal excursions are always positive, so I guess that explains the model consistently undershooting. This does seem like important variability to capture, and anybody familiar with gas / electricity markets in the NE will notice the lack of seasonal price swings.

I mentioned to @silky that it might be useful to work on a real problem together to see if the technical adviser position would be a good fit, and this seems like a kind of problem that might be adjacent to his experience, and not require too much spelunking through our codebase to engage on.

Should I go ahead and move the sector_id_eia into the plants_eia860 table and out of the plants_entity_eia table so it can vary annually?

zaneselvans commented 2 years ago

If we have to use the API data anyway to capture this huge variability in one of the most populous parts of the US, I'm curious whether using the estimator on top of the reported aggregate prices seems like a good idea? Would we expect that many of the other predictive factors from other states / plants would also apply in the NE, on top of the background seasonal variability?

And if we want to depend on the API, how can we make that less painful? A couple of options from above:

Archive the API data so we can access it the same way as the other static data sources.
Use the API during ETL rather than after the fact so it only has to happen once, during the data build (I think something like this is where we're headed more generally with bringing the derived / analytical values into the DB).
Use data from the API to train a model and then re-use that pre-trained model (pickling it or something) for future imputations, updating it once a year when we bring in new historical data for the EIA forms overall.

zaneselvans commented 2 years ago

@silky let me know if you want additional context. This is one of several issues in an epic #1708 and there's also a somewhat out of date previous PR #1696. And the EIA API we're talking about (and frustrated by...) is over here

zaneselvans commented 2 years ago

Okay @TrentonBush and I just had a chat and decided to move forward with archiving the EIA API data and adding it to the Zenodo datastore, so we can access it as a static resource just like everything else, which we update annually with the final data. Then we'll need to figure out exactly how we want to integrate it -- whether it makes sense to use it in conjunction with the fuel receipts and costs data to train a model that can be aware of the seasonal variation in places like the NE that have virtually no disaggregated data, or if we would want to go back to simply filing in the missing data with state-month aggregates (though that still left us with about 10% of all records having no fuel price)

Tasks

@TrentonBush to figure out how to scrape the electricity section of the EIA's v2 API and turn it into nice tables.
@zaneselvans to move the actually-varying sector_id_eia into the plants_eia860 table rather than the plants_entity_eia table. See #1748

TrentonBush commented 2 years ago

The EIA aggregates have a hierarchical/multi-level structure: the annual aggregates are composed of quarterly estimates are composed of monthly aggregates, etc. We could, if deemed worth the research effort, use this hierarchical structure to partially dis-aggregate the records. This could improve the precision of our estimates for data redacted from the FRC table.

Our primary sourcing options are 1) the EIA web API and 2) EIA Electric Power Monthly (EPM). These different sources make available slightly different resolutions of the data. Because the data formats are quite different (JSON vs compressed excel sheets), it may be cost prohibitive to integrate both sources.

The EIA API gives 3 spatial resolutions (state, Census region, national), 3 temporal resolutions (monthly, quarterly, annual) and 3-ish sectoral resolutions (15 total categories including CHP/non-CHP, sector, IPP + Utility, all sectors) for a total of up to 135 hierarchically-related timeseries. A single plant could appear in as many as 32 different aggregates. In practice, most of the higher-resolution aggregates will be redacted, so a more typical number is probably something like 20.

The EPM version is similar, but with one fewer sectoral level (no CHP breakdown) and slightly different temporal levels (replaces quarterly aggregate with a monthly year-to-date aggregate. Also provides a 12-month rolling aggregate at the national level).

zaneselvans commented 2 years ago

It seems like the marginal benefit for including the EPM data is modest, but the addition of a whole new spreadsheet dataset would be pretty annoying.

Is there an off-the-shelf way to use the hierarchical time series to partially recover the redacted values and integrate those constraints into the estimator? Or is this something that has to be custom built?

zaneselvans commented 2 years ago

See #1767 for continuation of this work.

catalyst-cooperative / pudl