catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Explore external validation sources for fuel imputation #1712

Closed TrentonBush closed 2 years ago

TrentonBush commented 2 years ago

Our fuel imputations, regardless of how great of a model we can make, are limited by the fact that missing data is unobserved and uncertain. We have a couple of sources of potential external validation to help bound that uncertainty and validate the imputations. But they are higher level aggregates that will take some translation to use.

zaneselvans commented 2 years ago

I poked around at the other fuel categories available in the EIA API, and if IIRC it only provided:

If we attempt to reverse engineer the calculation that's feeding into the EIA API would we end up at one of these endpoints?

But maybe we can find the actual methodology and implement it, or ask them directly whether there's any additional information about the redacted prices aggregated in there.

I think we'll struggle to wring much useful information out of the FERC 1 here for a few reasons:

TrentonBush commented 2 years ago

What I'm hoping happens with EIA data is that we can exactly reproduce their numbers for state-months that happen to have no missing data. This confirms that we understand the methodology. Then for non-matching ones, we can subtract out the data we have and infer that the differences belong to the missing values. The "monthly energy_source_code prices at the national level" could possibly help disaggregate further.

Then we can compare our imputed values, aggregated to the same level, to the EIA values. I don't think we can train a model on metrics that coarse, but it will be helpful to check for overall bias.

TrentonBush commented 2 years ago

Validity Concerns & Path Forward

IPPs Don't Reveal Their Costs

Missing cost data is essentially binary by plant - a given plant either reports everything or nothing. 47% of plants fall into each category, with only 6% having a mixture of null and not-null. The problem is that plants are not missing at random: almost no IPPs reveal their cost data. There are two problems with this:

  1. Is it valid to assume IPPs pay the same prices as utilities? We can't test this without alternative data sources such as EIA EPM.
  2. Geographic clustering of IPPs (see below) will force imputation models to extrapolate, harming accuracy. image.png

    IPPs are Clustered Geographically

    After fuel type and time, location is one of the most impactful variables that determine fuel costs. If location data is not available, prediction accuracy will suffer. IPPs are clustered in certain areas: the Northeast, Texas, and California. These areas have few or even zero utility owned plants, which means there is very little cost data from nearby plants. This means that our imputation models will not be able to use geographic proximity very well. image.png The states of Pennsylvania and Maine, for example, have zero plants that report fuel costs. ISO-NE, NYISO, PJM, ERCOT, and CAISO all have several times more IPP plants than utility plants. image.png

    Good News: EIA Includes IPPs in Aggregates

    They redact aggregates that are too specific, but they do include the data in others. So we can use this to test, on an aggregate level, our assumption about IPPs paying similar rates to utilities and to test that our model can extrapolate geographically.

zaneselvans commented 2 years ago

Wow, half of plants report no price data?! I had no idea. It's only about 1/3 of the total heat content. I wonder if that's explained by the IPP plants being smaller, more efficient, or having lower capacity factors?

Even if the best we can do is to demonstrate that our estimates are consistent with the aggregated EIA data which includes redacted prices, I think there's a lot of value in making complete and reasonable prices available.

Also from an advocacy point of view, I think the non-IPP plants are much more likely to be targeted, since (in theory) the IPPs already have a clear incentive not to run if they're economically uncompetitive, so having less certainty on the IPP fuel prices is probably not horrible.

Another aggregated sanity check could be to look at the monthly per-state, per-fuel prices that EIA reports which are somewhat distinct from the purchase prices. For NG at least it's reported for the electric sector as a whole. Hopefully these prices would be available in a different part of the API. It looks like the coal price data is broken down by state / census region / coal basin, as well as market type (captive, open, and total). Though I guess maybe the location of sale and the location of consumption could be different, so maybe this would only be accurate for fuel deliveries where the mine state and plant state (or census region, or coal basin are the same.

zaneselvans commented 2 years ago

I made some 2-d histograms / heatmaps to compare the reported and predicted fuel prices, and was surprised at how much dispersion there is. The 1-d distributions can look very similar, even though there doesn't seem to be great correlation between the individual reported & predicted values. E.g:

image

(I was using a scatterplot before, but it was extremely slow with hundreds of thousands of points to plot)

TrentonBush commented 2 years ago

is the above plot basically y_train vs model.predict(x_train)?

zaneselvans commented 2 years ago

Yes that's right. Lots more (1D) plots over here #1720

TrentonBush commented 2 years ago

Reproducing EIA's Aggregates

To validate our imputations against EIA's aggregates, we first have to confirm that we understand what their aggregates are. EIA provides aggregate prices for state-month-fuel-sector groups. Errors can arise from any of those group definitions or from the aggregate metric itself or possibly from differences in data processing. So far, after filtering for groups with 100% data availability in the FRC table, I have been able to exactly reproduce their aggregates for only 72% of groups. Natural gas has a particularly large discrepancy, with a 4 cent bias (FRC < EIA API). I suspect the remaining error comes from either data processing or sector definitions.

A brief point on "exact" matches: EIA data is provided with 2 decimal places of precision (0.01). I consider values within half a unit-in-the-last-place (so 0.005) to be "exact" matches. This accounts for rounding error.

Aggregation: MMBTU-weighted average price

I am very confident in the aggregate metric because, if it was wrong, there is no way 72% of values would be exactly correct.

Group Definitions

Error Metrics and Plots

The following is for the "Electricity Sector" defined as regulated utiliities plus IPPs plus NaN sectors. The state-month-fuel-sector groups have been filtered to those with 100% data availability (zero imputation). This filtering effectively excludes all data with any IPP, industrial, or commercial fuel purchases. Error is defined as EIA aggregates minus FRC aggregates. image

Error by Fuel

Recall that the numerical precision is 0.01, so errors less than 0.005 are essentially zero.

fuel_group_eia  | mae | mse | med_ae | mape | n_samples | n_exact | frac_exact | mean_bias | median_bias -- | -- | -- | -- | -- | -- | -- | -- | -- | -- coal | 0.0093 | 0.00083 | 0.00304 | 0.0044 | 2576 | 2053 | 0.7969 | -0.0051 | -0.000513 natural_gas | 0.2068 | 0.8084 | 0.0040 | 0.0432 | 1119 | 699 | 0.6246 | 0.0429 | -0.000092 petroleum | 0.1111 | 0.3362 | 0.0036 | 0.0066 | 2702 | 1787 | 0.6613 | -0.0022 | 0.000000 petroleum_coke | 0.0385 | 0.0158 | 0.0030 | 0.0169 | 704 | 589 | 0.8366 | -0.0352 | -0.000479
TrentonBush commented 2 years ago

Imputation Validation

Using the same method as above, I looked at the aggregate accuracy of modeled prices for the electricity sector (IPP + utility + NaN). The particular model used here produced large biases, particularly of gas and oil prices.

Baseline: look at training data only

Again filter for groups with 100% data availability, but instead of aggregating measured values I aggregate modeled ones. This shows the interpolation performance of the model. image

Error by Fuel

fuel_group_eia | mae | mse | med_ae | mape | n_samples | n_exact | frac_exact | mean_bias | median_bias -- | -- | -- | -- | -- | -- | -- | -- | -- | -- coal | 0.0782 | 0.0917 | 0.0426 | 0.0324 | 2576 | 188 | 0.0729 | 0.0218 | 0.0038 natural_gas | 0.7442 | 2.3192 | 0.3578 | 0.1402 | 1119 | 11 | 0.0098 | 0.5523 | 0.2113 petroleum | 1.1690 | 3.0009 | 0.8075 | 0.0754 | 2702 | 14 | 0.0051 | 0.2240 | -0.0095 petroleum_coke | 0.2203 | 0.1090 | 0.1326 | 0.1578 | 704 | 32 | 0.0454 | -0.0533 | -0.0279

Pure Imputation Performance

Now I filter for pure imputation: groups with zero measured data. This shows extrapolation performance. Note that there are many state-month-fuel-sector groups that are neither pure imputation nor pure measurement -- they are not included here. This sample is regionally concentrated in IPP image

Error by Fuel

fuel_group_eia | mae | mse | med_ae | mape | n_samples | n_exact | frac_exact | mean_bias | median_bias -- | -- | -- | -- | -- | -- | -- | -- | -- | -- coal | 0.2642 | 0.1320 | 0.2091 | 0.0911 | 369 | 3 | 0.0081 | 0.0334 | -0.0149 natural_gas | 0.9959 | 3.6346 | 0.5497 | 0.2260 | 611 | 2 | 0.0032 | 0.3462 | -0.0396 petroleum | 1.5314 | 3.8381 | 1.2834 | 0.0961 | 345 | 0 | 0.0000 | -0.2761 | -0.6542 petroleum_coke | 1.0799 | 1.5611 | 1.0323 | 0.6176 | 42 | 0 | 0.0000 | -1.0799 | -1.0323

Mixed Imputation Performance

Now I filter for mixed imputation: groups with between zero and 100% measured data, exclusive. image

Error by Fuel

There is no petcoke in this sample.

fuel_group_eia | mae | mse | med_ae | mape | n_samples | n_exact | frac_exact | mean_bias | median_bias -- | -- | -- | -- | -- | -- | -- | -- | -- | -- coal | 0.1247 | 0.0360 | 0.0721 | 0.0597 | 976 | 44 | 0.0450 | -0.0381 | -0.0118 natural_gas | 0.5100 | 2.1763 | 0.2041 | 0.0970 | 2524 | 52 | 0.0206 | 0.2068 | 0.0007 petroleum | 1.3426 | 3.6684 | 0.9492 | 0.0783 | 645 | 2 | 0.0031 | 0.5835 | 0.3099
zaneselvans commented 2 years ago

It's not hard to switch a plant attribute from being static to annually varying. If 4% of plants changed sector between 2013 and 2020 that should definitely be in the annually varying table, not the static table. Our initial categorization of these things happened a long time ago and we've added a lot more years of data since then. We should probably go back over them and move anything questionable into the annual table. Do you think that's likely to be enough to address the mismatch between the API and our aggregated values?

Could we reasonably back/forward fill the plant sectors if they were treated as a time varying attribute? Or when sector is missing for a plant, is it missing for all time? Or maybe the missing sectors are partly a result of "inconsistent" values that were nulled out due to real variability inappropriately.

It seems like natural gas is the biggest source of variability / error.

Are you clipping the crazy fuel prices (like > $1000/mcf gas?) before doing the aggregations?

What do you think of the performance overall? A 22% MAPE for natural gas in the pure imputation case seems kind of big.

Is it possible to set some kind of aggregate constraint on the model, to require (or incentivize) it to match the aggregated values as well as the reported per-plant prices? If it was possible would we expect it to actually improve the model?

zaneselvans commented 2 years ago

Some basic facts

Some Options...

TrentonBush commented 2 years ago

Do you think [sector changes are] likely to be enough to address the mismatch between the API and our aggregated values?

I doubt that impacts 28% of records, but I don't really see any single "smoking gun" factor here.

when sector is missing for a plant, is it missing for all time?

Currently yes, because they are static attributes. I have not explored the time varying sectors beyond my comparison of 2013 to 2020, but I would expect forward/backward filling to work just fine.

Are you clipping the crazy fuel prices (like > $1000/mcf gas?) before doing the aggregations?

Yes, which would contribute to the modeled bias if EIA leaves them unchanged. But I think you mentioned there are only 200-ish such values, so I doubt that can explain 28% of aggregates.

What do you think of the performance overall? A 22% MAPE for natural gas in the pure imputation case seems kind of big.

I agree, it does seem big. That may be because the "pure imputation" cases are clustered in the Northeast (I think, need to check), which I vaguely remember hearing is a tight gas market compared to the rest of the country, at least in winter. With no training data there, I'm not sure any model is going to perform super well.

Is it possible to ... incentivize [the model] to match the aggregated values as well as the [training data]? ... would we expect it to actually improve the model?

Yes it is possible via either a custom loss function or maybe by somehow including EIA aggregates into the training data (like via imputation). It would improve the model in the sense of nudging the predictions towards the aggregates. There may be unwanted side effects like reducing real variation within the state-month-fuel-sector groups for groups with mixed reported/imputed values (which are more numerous but less impactful than pure imputations).

TrentonBush commented 2 years ago

I looked into error in more detail and found a few things:

This chart is a little busy, but the pieces are:

And the same plot as above, but filtered for states with zero imputed values. These are states where the FRC and EIA diverge, regardless of the model. I think this may be because of outlier handling (clipping) or non-static sector assignments. image

TrentonBush commented 2 years ago

Ah ya this is where our regional data availability is a problem 😕. The Northeast has a unique seasonality in gas prices that other regions simply don't have. Because the Northeast has little to no presence in the FRC data (our training data), I don't think it is possible for any model to reproduce this using only FRC data. We would have to use EIA aggregates somehow if we want to capture that. image

image

zaneselvans commented 2 years ago

Hmm, and it looks like the seasonal excursions are always positive, so I guess that explains the model consistently undershooting. This does seem like important variability to capture, and anybody familiar with gas / electricity markets in the NE will notice the lack of seasonal price swings.

I mentioned to @silky that it might be useful to work on a real problem together to see if the technical adviser position would be a good fit, and this seems like a kind of problem that might be adjacent to his experience, and not require too much spelunking through our codebase to engage on.

Should I go ahead and move the sector_id_eia into the plants_eia860 table and out of the plants_entity_eia table so it can vary annually?

zaneselvans commented 2 years ago

If we have to use the API data anyway to capture this huge variability in one of the most populous parts of the US, I'm curious whether using the estimator on top of the reported aggregate prices seems like a good idea? Would we expect that many of the other predictive factors from other states / plants would also apply in the NE, on top of the background seasonal variability?

And if we want to depend on the API, how can we make that less painful? A couple of options from above:

zaneselvans commented 2 years ago

@silky let me know if you want additional context. This is one of several issues in an epic #1708 and there's also a somewhat out of date previous PR #1696. And the EIA API we're talking about (and frustrated by...) is over here

zaneselvans commented 2 years ago

Okay @TrentonBush and I just had a chat and decided to move forward with archiving the EIA API data and adding it to the Zenodo datastore, so we can access it as a static resource just like everything else, which we update annually with the final data. Then we'll need to figure out exactly how we want to integrate it -- whether it makes sense to use it in conjunction with the fuel receipts and costs data to train a model that can be aware of the seasonal variation in places like the NE that have virtually no disaggregated data, or if we would want to go back to simply filing in the missing data with state-month aggregates (though that still left us with about 10% of all records having no fuel price)

Tasks

TrentonBush commented 2 years ago

The EIA aggregates have a hierarchical/multi-level structure: the annual aggregates are composed of quarterly estimates are composed of monthly aggregates, etc. We could, if deemed worth the research effort, use this hierarchical structure to partially dis-aggregate the records. This could improve the precision of our estimates for data redacted from the FRC table.

Our primary sourcing options are 1) the EIA web API and 2) EIA Electric Power Monthly (EPM). These different sources make available slightly different resolutions of the data. Because the data formats are quite different (JSON vs compressed excel sheets), it may be cost prohibitive to integrate both sources.

The EIA API gives 3 spatial resolutions (state, Census region, national), 3 temporal resolutions (monthly, quarterly, annual) and 3-ish sectoral resolutions (15 total categories including CHP/non-CHP, sector, IPP + Utility, all sectors) for a total of up to 135 hierarchically-related timeseries. A single plant could appear in as many as 32 different aggregates. In practice, most of the higher-resolution aggregates will be redacted, so a more typical number is probably something like 20.

The EPM version is similar, but with one fewer sectoral level (no CHP breakdown) and slightly different temporal levels (replaces quarterly aggregate with a monthly year-to-date aggregate. Also provides a 12-month rolling aggregate at the national level).

zaneselvans commented 2 years ago

It seems like the marginal benefit for including the EPM data is modest, but the addition of a whole new spreadsheet dataset would be pretty annoying.

Is there an off-the-shelf way to use the hierarchical time series to partially recover the redacted values and integrate those constraints into the estimator? Or is this something that has to be custom built?

zaneselvans commented 2 years ago

See #1767 for continuation of this work.