catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
489 stars 115 forks source link

Estimate monthly plant-level fuel prices w/o using the EIA API #1343

Closed zaneselvans closed 2 years ago

zaneselvans commented 3 years ago

Instead of using the EIA API to pull monthly average fuel costs by state and fuel when individual fuel deliveries have their costs redacted in the fuel_receipts_costs_eia923 table, calculate it for ourselves.

Motivation

This change will address several issues:

Approach

Choosing Aggregations

Intuitive Aggregation Priorities

  1. Most precise: ["state", "energy_source_code", "report_date"]
  2. Annual aggregation (coal): ["state", "energy_source_code", "report_year"]
  3. Regional aggregation (petroleum): ["census_region", "energy_source_code", "report_date"]
  4. Fuel type aggregation: ["state", "fuel_group_code", "report_date"]
  5. Both regional and fuel type aggregation: ["census_region", "fuel_group_code", "report_date"]
  6. Annual, regional, and fuel type aggregations: ["census_region", "fuel_group_code", "report_year"]

Questions:

Other Potential Refinements

Remaining tasks:

zaneselvans commented 2 years ago

Hey @katherinelamb I'd like to talk to you about how we might use some ML / regressions to fill in these missing fuel costs if that's an issue you might like to take on.

katie-lamb commented 2 years ago

This definitely sounds interesting and doable.

Potential data for features

It's probably best to start by trying to reduce the problem to just be either spatial or temporal and then try some regressions and common modeling frameworks. Seems like it's harder but still doable to model both spatially and temporally.
A novel framework for spatio-temporal prediction

zaneselvans commented 2 years ago

While we do have the lat/lon location for all the plants, for the "spatial" dimension I was imagining something like looking at the prices reported in adjacent states if they're available, since I imagine that there are generally regional price trends. I think this is what Aranya / Greg did. E.g. if a given state-month-fuel combination is missing a price, fill it in with the average of the reported price for adjacent states for that fuel and month, and then that filled-in state-month-fuel table could be used to estimate the cost per unit of fuel for any records in the fuel_receipts_costs_eia923 table that have redacted values. But maybe that's more vague and complex than it needs to be?

I guess in an ideal world we'd be able to make predictions / estimations for every record with a missing value, given all of the information available in a very generic way, but that seems like asking a lot. There would be a lot of categorical variables (state, buyer, seller, fuel type, spot market vs. contract purchase) numerical values (quantity of fuel purchased, heat content per unit of fuel, cost of fuel for that month in all adjacent states, or the region, or the country as a whole). And really you could bring in any number of utility, plant, and generator attributes if they were helpful.

zaneselvans commented 2 years ago

I looked into this today and found some things.

API vs. frc_eia923

Here's how the fuel API prices compare to an aggregation from the reported data using the (slightly different) high level categories. Each point corresponds to a particular (report_date, state, fuel_type) combo.

Coal

image

Oil

image

Gas

(note the different X and Y scales because of some outliers) image

What to do?

zaneselvans commented 2 years ago

I'm curious what @cmgosnell thinks about the simpler options above as an interim change.

cmgosnell commented 2 years ago

For context on the original choice to use the larger categories, iirc we went this the fuel_type_code_pudl so that we'd get higher coverage. iirc the original way we were doing it (i.e. not what is currently in dev) there was some (unnecessary) limiting factor, so we went with the broad fuel types to get broader coverage.

I agree generally that we can and probably should use the more granular energy_source_code. I assume that whatever we do won't align perfectly with EIA (in part because I could never find their methodology for how they generated these state-level values). I would personally not work with option 2 above (reform fuel_type_code_pudl). The desire imo is something defensible, not something that perfectly matches EIA.

zaneselvans commented 2 years ago

Do you think it makes sense to swap in our own average fuel prices based on the fuel_type_code_pudl aggregations for the time being, if it gives us coverage that's similar to what we're getting from the EIA API (which IIRC, wasn't great)?

Another thought: does it make more sense to try and estimate the price per unit or the price per MMBTU? (where unit is ton, barrel, or mcf). I would think that within these big categories there would be less dispersion on a $/MMBTU basis, since what you're really buying is heat content, not mass (i.e. BIT costs much more than LIG per ton, but they're closer to each other per MMBTU). I think we have the option of doing either one.

cmgosnell commented 2 years ago

I definitely think using fuel_type_code_pudl for the time being makes sense, but i assume its similar math to try it with energy_source_code, so I'd personally check the coverage of the latter before defaulting to fuel_type_code_pudl.

I am less familiar with the unit vs btu distinction, but I think you are on to something! If all the info is there (which I believe it is in the frc table), then yeah go for it... but again i'd check the energy_source_code first. iirc the original coverage of the API fuel source was artificially limited because of something I did that was silly and was later fixed. My gut tells me that we'd get pretty good overall coverage using just energy_source_code with our own data.

I would almost use state-level energy_source_code prices and then try nearby states or just a national average.

zaneselvans commented 2 years ago

Okay, I'll try the simplistic (report_date, state, energy_source_code) and (report_date, state, fuel_type_code_pudl) options and see how the coverage compares to what we're currently doing.

Beyond that, rather than immediately jumping to our own bespoke estimator based on adjacent states, I'd like to try a few generic regression options from scikit-learn that can make use of a lot more features in the FRC table, and that can be trained and validated on the records in the table that do have fuel prices (which is most of them).

I guess we can also easily check whether the price per physical unit or the price per MMBTU is more consistent just by plotting the distributions of observed prices in the dataset per date / state / fuel. It seems like whichever variable has less dispersion would be the easier one to estimate accurately. But curious what @katie-lamb and @TrentonBush think with their stats background.

zaneselvans commented 2 years ago

Maybe @priyald17 or @joshdr83 would have suggestions on how best to approach this too.

zaneselvans commented 2 years ago

For context here's the fuel_receipts_costs_eia923 table on Datasette and the field we're trying to estimate is fuel_cost_per_mmbtu. There are a bunch of additional attributes associated with both the plant_id_eia and the mine_id_pudl including the lat/lon of the plants and the county FIPS code associated with many of the coal mines, that could also be used to construct features.

TrentonBush commented 2 years ago

To clarify the problem:

Is that right?

I see two separate parts to this:

  1. How well can we model known prices given all the other info we have in PUDL?
  2. Given the EIA API has information we don't have, how can we use that to validate our imputations of unknown prices?

Relatedly, the first few scoping questions I'd ask are:

Ideally we test the validity, but remember we do have the option to just "assume and caveat" if the stakes are low. Plus, if EIA has done their job, there won't be a good way to test it.

With respect to modelling, I'd let the "what are the stakes" question guide the degree of complexity (simple averages vs fancier models). I'm not familiar with how redaction works (are the same plants always redacted? Does it come and go? etc), but in principle I think we have everything we need to construct a well formulated ML solution, if that is warranted.

katie-lamb commented 2 years ago

I wrote down these questions that are pretty similar to what Trenton asks above:

If the redacted data is kind of "one off" and there is still enough representative data, then doing a regression to estimate prices for each date, state, and fuel type could definitely work. But I also agree that we have all the tools and data needed to do something more comprehensive if it seems that the missing data can't be represented that simplistically.

Although I think I got a little lost at some point about the construction of fuel_type_code_pudl vs.energy_source_code and price per MMBTU vs. price per unit, I think it makes sense to do some feature selection with the attributes of the fuel receipts cost table + energy_source_code and attributes from related tables for each of these potential price attributes that we may try and estimate, instead of assuming that nearest states or something is the best estimator. The scores from this feature selection could also reveal something about what price attribute best to estimate. Maybe the k best features are similar for each estimated variable or maybe there's one variable that stands out as more easily modeled.

zaneselvans commented 2 years ago

The distribution of missing values is skewed towards the early years (2008-2011) before filling with the API:

image

After filling with the API it's pretty uniform across all the years:

image

zaneselvans commented 2 years ago

What do you think would be a good way to evaluate the dispersion of fuel prices? It seems like there are at least 3 dimensions to consider in sampling and estimating fuel prices:

Using the more granular options (plants, months, energy_source_code) seems like it should give us lower dispersion results, but it'll also reduce the sample sizes pretty dramatically, reducing confidence in the number we get out. For example, here are some state-level distributions for a single year, and a single energy source code. I chose a mix of big states (CA, FL, NY, TX) medium states (CO, WY) and small states (ID, ME). NG and SUB are the most common fuel types.

Price distribution by state for NG (2020)

image

Price distribution by state for SUB (2020)

image

zaneselvans commented 2 years ago

Rather than just counting individual fuel deliveries up, would it make more sense to weight them by the total heat content of the fuel delivered?

TrentonBush commented 2 years ago

The following is ONLY for gas prices (filtering not shown)

I'm sure we could tighten that much further with a better model. I'm just not sure what our goals should be here without an application in mind. Maybe a reasonable next step would be to throw a low-effort, high-power model like XGBoost at it and see what kind of accuracy is possible.

# I didn't bother with any train-test split here so these error values should be considered best-case indicators
frc['global_median_fuel_price'] = frc['fuel_cost_per_mmbtu'].median()
frc['month_median_fuel_price'] = frc.groupby('report_date')['fuel_cost_per_mmbtu'].transform(np.median)
frc['state_month_median_fuel_price'] = frc.groupby(['state', 'report_date'])['fuel_cost_per_mmbtu'].transform(np.median)

error = frc.loc[:, [
                           'global_median_fuel_price',
                           'month_median_fuel_price',
                           'state_month_median_fuel_price',
                        ]
               ].sub(frc['fuel_cost_per_mmbtu'], axis=0)

pct_error = error.div(frc['fuel_cost_per_mmbtu'], axis=0)

image image image

katie-lamb commented 2 years ago

To first address the goal of removing the EIA API calls from the CI asap:

Do you have an accuracy metric on how well the most granular option (plants, months, energy_source_code) performs against the API results? Or how swapping in state level instead of plant affects accuracy? I think for now you could just sub in the median of whatever combo of the three dimensions gets the highest API accuracy. It seems like it might be state, month, energy code. Although @zaneselvans mentioned there are states with no data (I think you actually said a state-level average is not available), so does this necessitate multi-state aggregations?

After this quick fix, I think looking at what other features we could bring into a more powerful model would be useful.

TrentonBush commented 2 years ago

Are the API results our target or are the redacted values our target? I thought the API was basically just an outsourced imputation model. If it includes redacted data in its aggregates then it has value for validation. Is there anything more to the EIA API?

katie-lamb commented 2 years ago

I believe you're right and the API is just modeling these values. After talking to Zane, I think to get a quick fix for today/tomorrow the API results are our target since people currently using the table don't want the values to change significantly. But to develop a more accurate and comprehensive solution in the less short term, the redacted values are indeed the target.

TrentonBush commented 2 years ago

Oh I didn't know there was a timeline on this at all, much less tomorrow! Do what you gotta do

katie-lamb commented 2 years ago

LOL ya I don't think this discussion is really framed around a timeline and is in fact more long term ideating of modeling tactics. My comment was really just spurred by annoyance that the CI is unreliable and wanting something to stand in there while a better solution is developed.

joshdr83 commented 2 years ago

Hey sorry, just tuning in! I did a spatial interpolation for fuel prices to average down to the county level for new build estimates in this paper. Is it mostly the non ISO regions that are short of data?

zaneselvans commented 2 years ago
from collections import OrderedDict
import scipy.stats

cols = [
    "report_date",
    "plant_id_eia",
    "state",
    "energy_source_code",
    "fuel_cost_per_mmbtu",
    "fuel_group_code",
    "fuel_consumed_mmbtu", # required for weighting
]

CENSUS_REGIONS = {
    "ENC": ["WI", "MI", "IL", "IN", "OH"],
    "ESC": ["KY", "TN", "MS", "AL"],
    "MAT": ["NY", "PA", "NJ"],
    "MTN": ["MT", "ID", "WY", "NV", "UT", "CO", "AZ", "NM"],
    "NEW": ["ME", "VT", "NH", "MA", "CT", "RI"],
    "PCC": ["CA", "OR", "WA"],
    "PCN": ["AK", "HI"],
    "SAT": ["WV", "MD", "DE", "DC", "VA", "NC", "SC", "GA", "FL"],
    "WNC": ["ND", "SD", "MN", "IA", "NE", "KS", "MO"],
    "WSC": ["OK", "AR", "TX", "LA"],
}

CENSUS_REGION_MAPPING = {state: region for region, states in CENSUS_REGIONS.items() for state in states}

aggs = OrderedDict({
    # The most precise estimator we have right now
    "state_esc_month": {
        "agg_cols": ["state", "energy_source_code", "report_date"],
        "fuel_group_code": None,
    },
    # Good for coal, since price varies much more with location than time
    "state_esc_year": {
        "agg_cols": ["state", "energy_source_code", "report_year"],
        "fuel_group_code": "coal"
    },
    # Good for oil products, because prices are consistent geographically
    "region_esc_month": {
        "agg_cols": ["census_region", "energy_source_code", "report_date"],
        "fuel_group_code": "petroleum",
    },
    # Less fuel specificity, but still precise date and location
    "state_fgc_month": {
        "agg_cols": ["state", "fuel_group_code", "report_date"],
        "fuel_group_code": None,
    },
    # Less location and fuel specificity
    "region_fgc_month": {
        "agg_cols": ["census_region", "fuel_group_code", "report_date"],
        "fuel_group_code": None,
    },
    "region_fgc_year": {
        "agg_cols": ["census_region", "fuel_group_code", "report_year"],
        "fuel_group_code": None,
    },

    "national_esc_month":{
        "agg_cols":  ["energy_source_code", "report_date"],
        "fuel_group_code": None,
    },
    "national_fgc_month": {
        "agg_cols": ["fuel_group_code", "report_date"],
        "fuel_group_code": None,
    },
    "national_fgc_year": {
        "agg_cols": ["fuel_group_code", "report_year"],
        "fuel_group_code": None,
    },
})

frc = (
    pudl_out.frc_eia923()
    .loc[:, cols]
    .assign(
        report_year=lambda x: x.report_date.dt.year,
        census_region=lambda x: x.state.map(CENSUS_REGION_MAPPING),
        fuel_cost_per_mmbtu=lambda x: x.fuel_cost_per_mmbtu.replace(0.0, np.nan),
        filled=lambda x: x.fuel_cost_per_mmbtu,
    )
)

frc["filled_by"] = pd.NA
frc.loc[frc.fuel_cost_per_mmbtu.notna(), ["filled_by"]] = "original"

for agg in aggs:
    agg_cols = aggs[agg]["agg_cols"]
    fgp = aggs[agg]["fuel_group_code"]

    logger.info(f"Aggregating fuel prices by {agg}")
    frc[agg] = (
        frc.groupby(agg_cols)["fuel_cost_per_mmbtu"]
        .transform("median") # Use weighted median to avoid right-skew
    )

    logger.info(f"Calculating error for {agg}")
    frc[agg+"_err"] = (frc[agg] - frc.fuel_cost_per_mmbtu) / frc.fuel_cost_per_mmbtu

    logger.info(f"Filling fgp with {agg}")
    if fgp:
        logger.info(f"Filling only {fgp}")
    mask = (
        (frc.filled.isna())
        & (frc[agg].notna())
        & (frc.fuel_group_code == fgp if fgp else True)
    )
    logger.info(f"Mask selects {sum(mask)} values.")
    frc.loc[mask, "filled_by"] = agg
    frc.loc[mask, "filled"] = frc.loc[mask, agg]
frc.filled_by.value_counts()
original              373841
state_esc_month       127336
region_fgc_month       30420
national_esc_month     10091
region_esc_month        9823
state_fgc_month         6836
national_fgc_month      1079
region_fgc_year          702
state_esc_year           232
Name: filled_by, dtype: int64

Error distributions by fuel group

Generally these lined up with my intuition about which kinds of aggregation would introduce more error (as represented by the interquartile range of the error distribution). The ordering is different for the different fuel types generally though, and we could just automatically apply all of them on a per-fuel basis, in order of increasing IQR. The impact would be pretty marginal, but it would also do the right thing if future data looked different. We could also break out each year separately and apply that process, so that if the nature of the fuel price correlations changes over time, it would capture and and do the right thing each year (e.g. if lots of gas pipelines get built, maybe regional variation in gas prices would go away, and in those later years averaging across large areas would have much less impact than it does now...).

matplotlib.rcParams["figure.figsize"] = (10, 9)
for fgc in ["coal", "petroleum", "natural_gas"]:
    fig, axes = plt.subplots(ncols=3, nrows=3, sharex=True, sharey=True)
    for agg, ax in zip(aggs, axes.flatten()):
        logger.info(f"Plotting {agg} error")
        error = frc.loc[frc.fuel_group_code==fgc, agg+"_err"]
        ax.hist(error[(error != 0.0)], range=(-1,1), bins=200)
        iqr = scipy.stats.iqr(error, nan_policy="omit")
        ax.set_title(f"{agg} (iqr={iqr:0.3})")
        ax.axvline(x=0, lw=1, ls="--", color="black")

    fig.suptitle(f"Fuel Group: {fgc}", fontsize=16)
    fig.tight_layout()
    plt.show();

image image image

Filled vs. Reported Distributions

To see how different the filled values are from the reported values, and whether there's an obvious bias being introduced, I made the plots below. It looks like maybe there's a sliiiiight rightward skew in the filled values for coal and natural gas, but not sure it's significant.

fig, axes = plt.subplots(ncols=3)#, sharex=True, sharey=True)
fuel_group_codes = ["coal", "petroleum", "natural_gas"]
matplotlib.rcParams["figure.figsize"] = (10, 4)
for fgc, ax in zip(fuel_group_codes, axes):
    ax.hist(
        frc.loc[frc.fuel_group_code == fgc, "fuel_cost_per_mmbtu"],
        label="reported",
        alpha=0.5,
        range=(0,40),
        bins=80,
        density=True,
    )
    ax.hist(
        frc.loc[(frc.fuel_group_code == fgc) & (frc.fuel_cost_per_mmbtu.isna()), "filled"],
        label="filled",
        alpha=0.5,
        range=(0,40),
        bins=40,
        density=True,
    )
    ax.legend(loc="upper right")
    ax.set_title(f"Fuel Group: {fgc}")
    ax.set_xlabel("Fuel Price [$/MMBTU]")
fig.tight_layout()
plt.show()

image

zaneselvans commented 2 years ago

To compare the filled in values from the API vs. our aggregations for each fuel group and aggregation method, I made this plot.

sns.relplot(
    data=frc[
        (frc.fuel_group_code.isin(["coal", "petroleum", "natural_gas"]))
        & (frc.fuel_cost_from_eiaapi)
        & (~frc.filled_by.isin(["state_esc_year", "region_fgc_year"]))
    ],
    x="fuel_cost_per_mmbtu",
    y="filled",
    row="fuel_group_code",
    col="filled_by",
    facet_kws=dict(xlim=(0,40), ylim=(0,40)),
    linewidth=0.0,
    alpha=0.1,
    s=5,
)

Here the x-axes are the API based fuel price, and the y-axes are the aggregation based fuel prices. Columns are the aggregation methods, and rows are the fuel groups. In the aggregation methods fgc = fuel group code, and esc = energy source code (the more specific fuels)

image

For the petroleum and natural_gas fuel groups, there's a good 1:1 correlation between the API and aggregation based prices, but it's harder to see what's going on with coal because the points are all bunched together so I zoomed in on those prices only:

sns.relplot(
    data=frc[
        (frc.fuel_group_code.isin(["coal"]))
        & (frc.fuel_cost_from_eiaapi)
        & (~frc.filled_by.isin(["state_esc_year", "region_fgc_year"]))
    ],
    x="fuel_cost_per_mmbtu",
    y="filled",
    col="filled_by",
    col_wrap=3,
    facet_kws=dict(xlim=(0,8), ylim=(0,8)),
    linewidth=0.0,
    alpha=0.1,
    s=5,
)

For the state-level aggregations, there's some scatter, but generally a correlation between the API and aggregation based fuel prices. However, for the regional and national averages, it seems like there's a more substantial difference between the two methods. Some of the variability that's present in the API results is getting flattened by these larger spatial averages.

image

This isn't to say that the new method is necessarily worse, since there were fuel type categorizations are worse, given the fuel grouping differences between PUDL and EIA, but it does suggest that there's some benefit to be had by refining the imputation further, especially since one of our main applications of this data is to identify unusually expensive coal plants for advocates. We also have the most additional information about the coal deliveries, since we usually know the mine name, what state it's in, maybe a unique ID for the mine, the supplier name, level of ash and sulfur in the coal, what kind of contract it was delivered on, etc.

Here are the proportions of coal records that are being filled in with the different aggregations:

frc[frc.fuel_group_code=="coal"].filled_by.value_counts() / sum(frc.fuel_group_code=="coal")
original              0.713344
state_esc_month       0.144066
region_fgc_month      0.067036
national_esc_month    0.042347
state_fgc_month       0.026690
national_fgc_month    0.004286
state_esc_year        0.001138
region_fgc_year       0.001093

About 11% of all coal records end up getting regional or national average prices assigned to them, and that's a little more than 1/3 of the total coal records which are being filled in.

I'm inclined to say that this is good enough for now, but that because these are probably some of the records we care most about, it will be worth coming back to this issue and trying to use a more generic model like XGBoost or a random forest that can utilize the other information we have.

What do y'all think?

TrentonBush commented 2 years ago

I was ready to say "yes this is good enough", especially considering the results were so similar to our previous method of EIA API values. But now I'm confused!

Here the x-axes are the API based fuel price

I think I'm not understanding exactly what these plots are -- I thought your scatter plots near the top of this issue showed that API vs state-month average were extremely linear. Why are the state-fgc-month results so much different here? I'd be surprised if changing the agg func from mean to median made that much difference. If it did, that's interesting to know!

zaneselvans commented 2 years ago

@TrentonBush The earlier scatter plots are comparing all the reported data -- so the ones where there actually was data in the FRC table, and they're only being aggregated by [state, month, fuel_group]

The more scatter recent plots are only looking at data points that were not present in the FRC table, and comparing the values which were filled in by our new method (breaking it out into all the different kinds of aggregation used) vs. the API values. So it's not surprising that the correlation is worse in general.

TrentonBush commented 2 years ago

Oh ok, now if I squint I can see how they are related -- basically remove the line and just look at the dispersion, plus add some x-coordinate noise to account for mean to median. Ya this model makes sense to me. I think using the median is justified given the handful of big outliers I ran into. We definitely want to use the mean for the purpose of disaggregating the API results in the future though.

one of our main applications of this data is to identify unusually expensive coal plants for advocates

That's an awesome use case I'd love to learn more about! There is certainly more accuracy to wring out of this data, and that kind of anomaly detection would probably require it. Maybe we find a day or two next sprint to try other models

zaneselvans commented 2 years ago

Yeah, there are some wild outliers. I want to figure out how to remove them and fill them in with reasonable values. Some of them are out of bounds by factors of like 1000x -- natural gas is particularly messy. The mean values were quite skewed.

Given that the values aren't really normally distributed, would it still make sense to replace anything more than 3-4$\sigma$ away from the mean? I could drop the top and bottom 0.1% of the distribution too, but there aren't always big outliers, and sometimes more than 0.1% of the points are bad. Not sure what the right thing to do is.

zaneselvans commented 2 years ago

I'm doing some additional normalization to the table and moving the fuel_group_code to the energy_sources_eia table, since the value is entirely determined by energy_source_code with this mapping:

assert (frc.groupby("energy_source_code").fuel_group_code.nunique() == 1).all()
dict(frc.groupby("energy_source_code").fuel_group_code.first())
{'BFG': 'other_gas',
 'BIT': 'coal',
 'DFO': 'petroleum',
 'JF': 'petroleum',
 'KER': 'petroleum',
 'LIG': 'coal',
 'NG': 'natural_gas',
 'OG': 'other_gas',
 'PC': 'petroleum_coke',
 'PG': 'other_gas',
 'RFO': 'petroleum',
 'SC': 'coal',
 'SGP': 'other_gas',
 'SUB': 'coal',
 'WC': 'coal',
 'WO': 'petroleum'}

I'm also renaming the column to energy_group_eiaepm since it's the fuel grouping that's used for reporting in the EIA Electric Power Monthly.

zaneselvans commented 2 years ago

That's an awesome use case I'd love to learn more about!

This is basically the RMI ReFi project! The fuel is by far the biggest component of the OpEx for coal plants, and this data has been used to identify which plants are vulnerable to getting refinanced out of existence.

zaneselvans commented 2 years ago

Okay, definitely some room for improvement. E.g. look at how how coal prices in NY drop dramatically (in the filled in data) starting in 2012 here:

def facet_fuel(
    frc: pd.DataFrame,
    states: list[str],
    fuel_facet: Literal["energy_source_code", "fuel_group_eiaepm", "fuel_type_code_pudl"],
    by: Literal["fuel", "state"] = "state",
    fuels: list[str] | None = None,
    max_price: float = 100.0,
    facet_kws: dict | None = None,
) -> None:
    mask = (
        (frc.state.isin(states))
        & (frc.fuel_cost_per_mmbtu <= max_price)
        & (frc[fuel_facet].isin(fuels) if fuels else True)
    )
    facet = fuel_facet if by == "state" else "state"
    sns.relplot(
        data=frc[mask],
        x="report_date",
        y="fuel_cost_per_mmbtu",
        hue=facet,
        style=facet,
        row=fuel_facet if by == "fuel" else "state",
        kind="line",
        height=4,
        aspect=2.5,
        facet_kws=facet_kws,
    );

facet_fuel(
    frc,
    fuel_facet="fuel_group_eiaepm",
    states=["CA", "NY", "FL", "TX"],
    fuels=["coal", "natural_gas", "petroleum"],
    by="fuel",
    facet_kws={"sharey": False},
)

image

zaneselvans commented 2 years ago

I ran tox -e nuke to generate a new database with the slight schema changes, and validate its contents, and ended up getting the capacity_mw and row count errors that @katie-lamb was encountering after the date_merge updates, so I merged the changes from her current PR (#1695) and re-ran the validations, but I'm still getting a bunch of row count errors, so maybe they really are due to something that changed here?

FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 2507830 rows, expected 2507040. Off by 0.032%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 214669 rows, expected 214613. Off by 0.026%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_nonuclear_eia923-2492441-2491704-213329] - ValueError: gf_nonuclear_eia923: found 213382 rows, expected 213329. Off by 0.025%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_nuclear_eia923-23498-23445-1961] - ValueError: gf_nuclear_eia923: found 1964 rows, expected 1961. Off by 0.153%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 2507830 rows, expected 2507040. Off by 0.032%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_nonuclear_eia923-2492441-2491704-213329] - ValueError: gf_nonuclear_eia923: found 2492441 rows, expected 2491704. Off by 0.030%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_nuclear_eia923-23498-23445-1961] - ValueError: gf_nuclear_eia923: found 23498 rows, expected 23445. Off by 0.226%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%

It's not clear off the top of my head why the changes in this PR would result in small changes to the number of rows in these other tables though. Does that make sense to anyone else?

zaneselvans commented 2 years ago

Comparing old expectations, new expectations, and the rows we've got here...

table observed rows old rows date_merge rows matches
gens_eia860_raw 491,469 491,469 490,824 old
gens_eia860_annual 491,469 491,469 490,824 old
gens_eia860_monthly 491,469 491,469 490,824 old
gf_eia923_raw 2,507,830 2,507,870 2,507,040 neither
gf_eia923_annual 214,669 214,709 214_613 neither
gf_eia923_monthly 2,507,830 2,507,870 2,507,040 neither
gf_nonuclear_eia923_annual 213,382 213,422 213,329 neither
gf_nonuclear_eia923_monthly 2,492,441 2,492,481 2,491,704 neither
gf_nuclear_eia923_annual 1,964 1,964 1,961 old
gf_nuclear_eia923_monthly 23,498 23,498 23,445 old
plants_eia860_raw 171,570 171,570 171,148 old
plants_eia860_annual 171,570 171,570 171,148 old
plants_eia860_monthly 171,570 171,570 171,148 old
pu_eia860_raw 170,792 170,792 170,370 old
pu_eia860_annual 170,792 170,792 170,370 old
pu_eia860_monthly 170,792 170,792 170,370 old
utils_eia860_raw 113,464 113,464 113,357 old
utils_eia860_annual 113,464 113,464 113,357 old
utils_eia860_monthly 113,464 113,464 113,357 old
zaneselvans commented 2 years ago

More weirdness.

The number of FRC records filled with each of the different aggregations seems a little different now than it was when I was originally developing the process in a notebook.

The lack of states is surprising, but seems like that's just how it is. The loss of a bunch of records in the date_merge() process is confusing. @katie-lamb and @cmgosnell could we talk about what might be going on there? I'm wondering if it might be due to the fact that the FRC table isn't a "time series" in that the report_date field isn't part of a primary key -- it's just a bunch of transactions, and there are some valid duplicate records. We use an autogenerated pseudo-key in the database. I'll compile a dataframe of the records that are getting lost.

katie-lamb commented 2 years ago

After running the ETL, I updated the row counts and merged that PR into dev so that should fix some of your issues @zaneselvans

zaneselvans commented 2 years ago

Ah okay so some of the row counts in your branch weren't actually the expected ones?

zaneselvans commented 2 years ago

@katie-lamb any intuition as to why merge_date() would be causing ~11,000 rows to get dropped from the FRC table?

zaneselvans commented 2 years ago

Comparing the records that exist before and after date_merge() it looks like all of the records for plants that are missing a state field (10,511) are getting dropped for some reason, plus 528 additional records that do have states associated with them. Note that the plant IDs that don't have states are also overwhelmingly NULL in the entity tables, so if there's any additional merging that's happening on other columns in there maybe that would mess things up?

Are these valid duplicate records that are getting treated strangely? If I list all of the duplicate records in the FRC table that I've pulled directly from the DB, it gives me 3510 of them, and all but one of them has a state, so it doesn't seem to be that.

In the same method chain where date_merge() is happening there's also a dropna(subset=["utility_id_eia"]) -- I bet that's where the problem is. The plants that don't have states are also missing almost everything else, so probably they're missing utility IDs. And then a smattering of the other plants just happen to be missing Utility IDs for some other reason.

Counting the rows in the dataframe before / after that dropna() specifically, that's indeed where the 11,000 records drop out. Whoops.

For the purposes of the FRC table is it absolutely necessary that we have the utility ID? The data is really about plants, fuels, and dates, so it seems like this isn't strictly required. If I remove it what happens...

In that case, we keep all the FRC records, but all of the fields which are brought in by merging with the results of pudl.output.eia860.plants_utils_eia860() are NA (plant name, PUDL ID, etc.). which may cause other unexpected problems downstream, so this probably isn't ideal, and the real issue is with how plants_utils_eia860() is working in cases where there are some missing ID fields, so I think I should make an issue to deal with this... #1700

zaneselvans commented 2 years ago

Looking for ways to identify outliers in not-so-normal distributions I came across the modified z-score, which is analogous to the z-score in normal distributions, but instead of looking at how many standard deviations away from the mean a data point is, it looks at how many Median Absolute Deviations (MADs) away from the median it is:

mod_zscore = abs(0.6745 * (data - data.median()) / data.mad())

This seems like the kind of measure we want.

In messing around with this I also ended up looking at the median (aka delivery weighted median) vs. (mmbtu) weighted median values of fuel prices in the FRC table, and they're pretty different. I think this is because crazy expensive fuel deliveries tend to be very small -- so they make up a disproportionate number of deliveries (records) compared to how much heat content they represent. The delivery weighted median is $3.25 / MMBTU but the MMBTU weighted median is $2.09 / MMBTU.

A weighted median function that can be used with df.groupby().apply():

def weighted_median(df: pd.DataFrame, data: str, weights: str) -> float:
    df = df.dropna(subset=[data, weights])
    s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))
    midpoint = 0.5 * np.sum(s_weights)
    if any(df[weights] > midpoint):
        w_median = df[data][weights == np.max(df[weights])].array[0]
    else:
        cs_weights = np.cumsum(s_weights)
        idx = np.where(cs_weights <= midpoint)[0][-1]
        if cs_weights[idx] == midpoint:
            w_median = np.mean(s_data[idx:idx+2])
        else:
            w_median = s_data[idx+1]
    return w_median

Grouping by report_year and fuel_group_eiaepm and calculating the modified z-score within each group, we get a pretty clean normal-ish distribution, with some long straggling tails that I think we can probably cut off.

DATA_COLS = [
    "plant_id_eia",
    "report_date",
    "energy_source_code",
    "fuel_cost_per_mmbtu",
    "fuel_received_units",
    "fuel_mmbtu_per_unit",
]

GB_COLS = ["report_year", "fuel_group_eiaepm"]

MAX_MOD_ZSCORE = 5.0

logger.info("Query PUDL DB")
frc_db = pd.read_sql("fuel_receipts_costs_eia923", pudl_engine, columns=DATA_COLS)
plant_states = pd.read_sql("SELECT plant_id_eia, state FROM plants_entity_eia", pudl_engine)
fuel_group_eiaepm = pd.read_sql("SELECT code AS energy_source_code, fuel_group_eiaepm FROM energy_sources_eia", pudl_engine)

logger.info("Join tables and calculate derived values")
frc_db = (
    frc_db
    .merge(plant_states, on="plant_id_eia", how="left", validate="many_to_one")
    .merge(fuel_group_eiaepm, on="energy_source_code", how="left", validate="many_to_one")
    .assign(
        report_year=lambda x: x.report_date.dt.year,
        fuel_mmbtu_total=lambda x: x.fuel_received_units * x.fuel_mmbtu_per_unit,
    )
    .pipe(apply_pudl_dtypes, group="eia")
)

logger.info("Calculate weighted median fuel price")
weighted_median_fuel_price = (
    frc_db
    .groupby(GB_COLS)
    .apply(weighted_median, data="fuel_cost_per_mmbtu", weights="fuel_mmbtu_total")
)
weighted_median_fuel_price.name = "weighted_median_fuel_price"
weighted_median_fuel_price = weighted_median_fuel_price.to_frame().reset_index()
frc_db = frc_db.merge(weighted_median_fuel_price, how="left", on=GB_COLS, validate="many_to_one")

logger.info("Calculate weighted fuel price MAD")
frc_db["delta"] = abs(frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price)
frc_db["fuel_price_mad"] = frc_db.groupby(GB_COLS)["delta"].transform("median")

logger.info("Calculate modified z-score")
frc_db["fuel_price_mod_zscore"] = (0.6745 * (frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price) / frc_db.fuel_price_mad).abs()

fraction_outliers = sum(frc_db.fuel_price_mod_zscore > MAX_MOD_ZSCORE) / len(frc_db)
logger.info(f"Modified z-score {MAX_MOD_ZSCORE} labels {fraction_outliers:0.2%} of all prices as outliers")

All fuels:

plt.hist(frc_db.fuel_price_mod_zscore, bins=80, range=(0,8), density=True)
plt.title("Modified z-score when grouped by fuel and year");

image

Fuel groups

sns.displot(
    data=frc_db[frc_db.fuel_group_eiaepm.isin(["coal", "natural_gas", "petroleum"])],
    kind="hist",
    x="fuel_price_mod_zscore",
    row="fuel_group_eiaepm",
    bins=80,
    binrange=(0,8),
    height=2.5,
    aspect=3,
    stat="density",
    linewidth=0,
);

image

zaneselvans commented 2 years ago

I think have weighted, modified z-score outlier detection working, but am wondering how to apply it appropriately.

The latter sounds more correct, but some of the groupings won't have a large number of records in them.

Also: what's the right wmod_zscore value to use as the cutoff for identifying outliers? At wmod_zscore == 5.0 it identifies about 1.5% of all records as outliers, mostly natural gas. Those outlier records account for only 0.2% of all reported MMBTU.

zaneselvans commented 2 years ago

Outliers identified using modified z-score of aggregations, with max_mod_zscore == 5.0:

TrentonBush commented 2 years ago

"Robust" outlier detection metrics are great! Their downsides are that 1) they can be too aggressive, especially on skewed data and 2) as you've noticed, they are slow because of all the O(n*logn) sorting.

I'm of two minds here. On one hand, it'd be fun and beneficial to tune this model for both accuracy and performance. On the other, we're considering replacing all this with a better model anyway, so maybe we should wait on any tuning until we make that determination. This outlier detection problem probably fits in better the with "improve the imputation model" tasks because both imputation and outlier detection might use the same model. Also, outlier detection is a scope increase from the original "remove the EIA API" goals. The very linear plots of state-month averages vs EIA API imply that EIA isn't doing much outlier removal, which means our users have been using data with outliers averaged in the whole time. I think it makes sense to change things once and well rather than possibly twice.

Accuracy improvement ideas

I'd start by seeing if we can un-skew the price distributions before (and maybe instead of) applying the robust metrics. A variance-stabilizing transform like log or Box-Cox might make the data "normal" enough to use regular z-scores.

Second, have we checked what the minimum sample sizes are for some of these bins? As far as I can tell, the existing method calculates a bunch of types of aggregates and iteratively fills in NaN values in order of model precision. Sometimes a key combination like ["Rhode Island", "2008-01-01", "Coal"] might be empty and produce np.nan, which is later filled by a coarser aggregate, but other times it might only have a handful of plants and produce weird metrics as a result. Now that we're using this model for both prediction and outlier detection, it could cause problems in both.

Finally, the model we've built here is basically a decision tree, just manually constructed instead of automated. Problems like choosing the right grain of aggregation or enforcing a minimum sample size per bin are taken care of automatically by those algorithms!

Performance improvement ideas

Medians are always going to be slower, but the line s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights])))) jumped out at me because of the use of python iterators on pandas/numpy objects. I think we could probably get the same result faster with something involving df.loc[:, [data, weights]].sort_values(data), df.loc[:, weights].cumsum(), and np.searchsorted

zaneselvans commented 2 years ago

I will admit that

 s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))

was just something I adapted from an example which was operating on lists or numpy arrays, and I didn't completely parse what it is doing to try and re-write it intelligently... like what is going on with the zip inside a zip and unpacking the sorted list. Is it obvious to you what it's doing? I'm not even sure what it means to sort a list of tuples necessarily.

zaneselvans commented 2 years ago

The minimum sample size in some of these bins is definitely small, but it'll use the small sample if it can (even if that's a bad idea). Requiring a bigger minimum sample size would result in more frequently resorting to the less specific estimations, but with many more samples.

Should we have just gone straight to using one of the Swiss-army-knife regression models instead, and just started with a modest set of input columns as features, to be expanded / engineered more later? I haven't really worked with them so I don't know how they're finicky or what the pitfalls are.

TrentonBush commented 2 years ago

I will admit that was just something I adapted from an example

Let he who has not copied from StackOverflow cast the first stone 😂

I had no clue what sorting tuples did, but I just tried it and it sorts them by their first value. I think that pattern is designed for python objects: when given iterables representing columns, the inner zip links the values by row so that sorted doesn't break the connection between them, then the outer zip puts them back into columnar layout. Mapping np.array then turns each column into a separate array. But pandas can accomplish the same thing with df.sort_values(<my_column>) and some indexing.

Also I just noticed the python any() function, which I think will also resort to python iteration. Using df.any() or np.any() will be much faster (relatively, probably not even 100ms difference at n=500k).

Should we have just gone straight to using one of the Swiss-army-knife regression models instead

No, starting with groupby is much simpler, plus it matched the algorithm we were trying to replace. But once we start adding complexity like conditionally switching between groupbys and checking sample sizes, I think that effort is better spent on a model with a higher ceiling.

zaneselvans commented 2 years ago

Okay here's a pandas-ified version

def weighted_median(df: pd.DataFrame, data: str, weights: str, dropna=True) -> float:
    if dropna:
        df = df.dropna(subset=[data, weights])
    if df.empty | df[[data, weights]].isna().any(axis=None):
        return np.nan
    df = df.loc[:, [data, weights]].sort_values(data)
    midpoint = 0.5 * df[weights].sum()
    if (df[weights] > midpoint).any():
        w_median = df.loc[df[weights].idxmax(), data]
    else:
        cs_weights = df[weights].cumsum()
        idx = np.where(cs_weights <= midpoint)[0][-1]
        if cs_weights.iloc[idx] == midpoint:
            w_median = df[data].iloc[idx : idx + 2].mean()
        else:
            w_median = df[data].iloc[idx + 1]
    return w_median

Old version: 4min 26s New version: 5min 33s 🤣

But they do produce exactly the same output!

TrentonBush commented 2 years ago

Moving some bits from Slack to here for posterity.

I tried (notebook) a quick implementation of an XGBoost equivalent (lightGBM). Using the same state, month, fuel features as the groupby model, it gave very similar accuracy. After adding a few new features, it cut relative error in about half. Training/inference time were around 10 seconds total Cross validation shows that the groupby model overfits and performs a bit worse on unseen data. Also, note the difference in counts below. I checked whether the GBDT accuracy was different on the records that the groupby model could predict, but the only significant difference was that other_gas results (now only 9 of them) were much better.

Test set relative error for groupby(['state', 'report_date', 'fuel_group_code']) medians image

Test set results for GBDT. Features: ['state', 'report_date', 'fuel_group_code', 'contract_type_code', 'primary_transportation_mode_code', 'energy_source_code', 'fuel_received_units', 'latitude', 'longitude',] image

zaneselvans commented 2 years ago

I'm closing this because it's effectively metastasized into #1708