catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

https://catalyst.coop/pudl

MIT License

489 stars 115 forks source link

Estimate monthly plant-level fuel prices w/o using the EIA API #1343

Closed zaneselvans closed 2 years ago

zaneselvans commented 3 years ago

Instead of using the EIA API to pull monthly average fuel costs by state and fuel when individual fuel deliveries have their costs redacted in the fuel_receipts_costs_eia923 table, calculate it for ourselves.

Motivation

This change will address several issues:

The EIA API is missing a fair amount of the data anyway. Sometimes whole state-months are missing. It also only contains data for coarse fuel categories (coal, petroleum, natural gas) rather than the specific fuel types.
Relying on the API means asking users to register for an API key and manage environment variables. This is a barrier for many of our less technical users.
Whenever something goes wrong with the API, our CI tests fail, and we can't work with this data locally. Over time this has been happening more frequently. HTML gets returned instead of JSON, or the network is down.
EIA is discontinuing the v1 API in November, 2022, so our current setup will stop working anyway.
There's a lot of information in the fuel_receipts_costs_eia923 table, and related to the plants and mines and suppliers involved. It should be possible to do a fairly good estimation of the fuel prices from scratch given all that context.

Approach

Estimate fuel prices using a variety of aggregations and use them to fill missing values.
Start with the most granular / accurate and progressively apply less specific estimates until everything is filled in.
Tag each record indicating which estimation was used to fill it in.
Pre-calculate all of the aggregations so that we can look at how they compare with actual values first.
Add each of these aggregations to the original FRC dataframe for plotting.
We should also include the EIA API values for comparison / constraint based on the redacted values.
Looking at the EIA API, only PEL, PC, COW, and NG really have values for \$/MMBTU at census region and state level.
Seems like the very granular fuel types only have prices for US Total, at least at the monthly level.
Use median values of the fuel prices in \$/MMBTU
Maybe calculate a weighted median? Want typical MMBTU, not typical delivery.
@gschivley and Neha suggested using both spatial and temporal interpolation -- averaging prices from the adjacent states, and filling in gaps in the monthly time series when possible.
We could also use a low-effort, but powerful estimator like XGBoost or a random forest to try and incorporate much more information, without designing something bespoke from scratch.
We should be able to benchmark these calculations against the data from the API or the specific information reported in the FRC table by doing some random knockouts to see how well we can recreate the reported values.

Choosing Aggregations

How do we decide how to prioritize aggregations?
Coal prices don't vary much month to month, aggregating annually would have little impact.
Gas & Petroluem prices can vary dramatically month to month, so aggregating across time is bad.
Petroleum fuel prices are highly correlated nationwide, so aggregating geographically has little impact.

Intuitive Aggregation Priorities

Most precise: ["state", "energy_source_code", "report_date"]
Annual aggregation (coal): ["state", "energy_source_code", "report_year"]
Regional aggregation (petroleum): ["census_region", "energy_source_code", "report_date"]
Fuel type aggregation: ["state", "fuel_group_code", "report_date"]
Both regional and fuel type aggregation: ["census_region", "fuel_group_code", "report_date"]
Annual, regional, and fuel type aggregations: ["census_region", "fuel_group_code", "report_year"]

Questions:

Should we use a MMBTU weighted median rather than delivery weighted median?
How should we identify outlier values in the fuel prices which should be replaced? Some are totally whacked.

Other Potential Refinements

Automatically fill using aggregations in order of increasing dispersion of the error distribution (e.g. IQR) rather than hard-coding the order based on intuition and eyeballing it.
Calculate the dispersion of the error distribution on an annual basis, rather than across the entire timeline, in case the temporal, fuel type & spatial correlations change over time.

Remaining tasks:

[x] Always plant_state into the fuel_receipts_costs_eia923 output table all the time.
[x] Add the census regions to state mappings into the metadata enums / constants.
[x] Replace the existing roll & fill method in the fuel_receipts_costs_eia923 output routine.
[x] Update tests to work with the new version of frc_eia923
[x] Remove API_KEY_EIA infrastructure from everywhere in the code, so we aren't unknowingly relying on it.
[x] Make filling in missing fuel prices the default behavior
[x] Fix the filled_by labeling, which is now showing all filled values having national_fgc_year which is the last aggregation.
[x] Remove fuel_group_code from the fuel_receipts_costs_eia923 table and add it to the energy_sources_eia coding table, and add it back into the output function.
[x] Understand why these changes are apparently affecting ouput row counts
[x] Pull the fuel price filling out into its own separate function
[x] Understand why merge_date() is removing ~10k frc_eia923 records.
[x] Implement weighted median function to use in filling & identifying outliers
[x] Add weighted median unit tests
[x] Identify outlying fuel prices using modified z-score with MMBTU weighted median
[ ] Have @cmgosnell look for weirdness in the results of a new MCOE calculation in an RMI context.
[ ] Update release notes
[ ] After merging into main remove API_KEY_EIA from the GitHub secrets.

zaneselvans commented 2 years ago

Hey @katherinelamb I'd like to talk to you about how we might use some ML / regressions to fill in these missing fuel costs if that's an issue you might like to take on.

katie-lamb commented 2 years ago

This definitely sounds interesting and doable.

Potential data for features

who is selling the fuel, who is buying it, what kind of fuel, how much, what kind of contract, prices in neighboring states, historical timeseries

It's probably best to start by trying to reduce the problem to just be either spatial or temporal and then try some regressions and common modeling frameworks. Seems like it's harder but still doable to model both spatially and temporally.
A novel framework for spatio-temporal prediction

zaneselvans commented 2 years ago

While we do have the lat/lon location for all the plants, for the "spatial" dimension I was imagining something like looking at the prices reported in adjacent states if they're available, since I imagine that there are generally regional price trends. I think this is what Aranya / Greg did. E.g. if a given state-month-fuel combination is missing a price, fill it in with the average of the reported price for adjacent states for that fuel and month, and then that filled-in state-month-fuel table could be used to estimate the cost per unit of fuel for any records in the fuel_receipts_costs_eia923 table that have redacted values. But maybe that's more vague and complex than it needs to be?

I guess in an ideal world we'd be able to make predictions / estimations for every record with a missing value, given all of the information available in a very generic way, but that seems like asking a lot. There would be a lot of categorical variables (state, buyer, seller, fuel type, spot market vs. contract purchase) numerical values (quantity of fuel purchased, heat content per unit of fuel, cost of fuel for that month in all adjacent states, or the region, or the country as a whole). And really you could bring in any number of utility, plant, and generator attributes if they were helpful.

zaneselvans commented 2 years ago

I looked into this today and found some things.

The API information we're using to fill in fuel prices right now is coarse in terms of fuel type. It only uses 3 high level categories, one each for all coal (COW), all petroleum derived fuels (PEL), and all natural gas (NG).
However, the FRC table has the much more detailed energy_source_code column that we could use for missing price estimation if we wanted to.
We map these three high level fuel categories to our own fuel_type_code_pudl (coal, oil, gas)... but we lump together different energy_source_code values in our categories than EIA does, which has to be introducing some misalignment between the filled in prices we're using, and what we'll get if we calculate monthly per-state, per-fuel costs.
Even so, looking just at these imperfectly mapped high level categories, and dropping all the NA before calculating the average fuel prices, the correlation is very high between the average coal/oil/gas price per unit as calculated from the FRC, and the value we get from the API.
I suspect we chose to use these high level categories because it was more likely to give us a value to fill in when it was redacted from the FRC table.
However we didn't understand the hierarchies of categories that the EIA was using at the time. Our categorization of coal, oil, and gas is more like solid, liquid, gas, while theirs is more indicative of the original resource the fuel was derived from (e.g. petcoke is solid so we called it coal, but it's derived from oil, so they call it petroleum).
Now that we have these different fuel attributes explicitly broken out in the energy_sources_eia coding table, we could choose to organize things differently if we wanted to, or potentially just use their categorizations instead of rolling our own.

API vs. `frc_eia923`

Here's how the fuel API prices compare to an aggregation from the reported data using the (slightly different) high level categories. Each point corresponds to a particular (report_date, state, fuel_type) combo.

Coal

Oil

Gas

(note the different X and Y scales because of some outliers)

What to do?

Right now we could fill in missing values using the coal/oil/gas aggregated state prices. This would yield slightly different results than we're currently getting because those categories correspond to slightly different sets of records in the EIA and PUDL universes right now. But I don't think the different results would obviously be less correct than the current ones.
We could reform the fuel_type_code_pudl so that it does correspond to the EIA categorizations, and that ought to bring the two calculations closer together.
I think the "right" thing to do is probably to use a regression (random forest?) to try and estimate prices for each combination of (report_date, state, energy_source_code) rather than fuel_type_code_pudl, using whatever information is available in the FRC table. Or I guess we don't necessarily have to restrict ourselves to just looking at state-level estimates. We could try and estimate individual plant level prices and see how it does.

zaneselvans commented 2 years ago

I'm curious what @cmgosnell thinks about the simpler options above as an interim change.

cmgosnell commented 2 years ago

For context on the original choice to use the larger categories, iirc we went this the fuel_type_code_pudl so that we'd get higher coverage. iirc the original way we were doing it (i.e. not what is currently in dev) there was some (unnecessary) limiting factor, so we went with the broad fuel types to get broader coverage.

I agree generally that we can and probably should use the more granular energy_source_code. I assume that whatever we do won't align perfectly with EIA (in part because I could never find their methodology for how they generated these state-level values). I would personally not work with option 2 above (reform fuel_type_code_pudl). The desire imo is something defensible, not something that perfectly matches EIA.

zaneselvans commented 2 years ago

Do you think it makes sense to swap in our own average fuel prices based on the fuel_type_code_pudl aggregations for the time being, if it gives us coverage that's similar to what we're getting from the EIA API (which IIRC, wasn't great)?

Another thought: does it make more sense to try and estimate the price per unit or the price per MMBTU? (where unit is ton, barrel, or mcf). I would think that within these big categories there would be less dispersion on a $/MMBTU basis, since what you're really buying is heat content, not mass (i.e. BIT costs much more than LIG per ton, but they're closer to each other per MMBTU). I think we have the option of doing either one.

cmgosnell commented 2 years ago

I definitely think using fuel_type_code_pudl for the time being makes sense, but i assume its similar math to try it with energy_source_code, so I'd personally check the coverage of the latter before defaulting to fuel_type_code_pudl.

I am less familiar with the unit vs btu distinction, but I think you are on to something! If all the info is there (which I believe it is in the frc table), then yeah go for it... but again i'd check the energy_source_code first. iirc the original coverage of the API fuel source was artificially limited because of something I did that was silly and was later fixed. My gut tells me that we'd get pretty good overall coverage using just energy_source_code with our own data.

I would almost use state-level energy_source_code prices and then try nearby states or just a national average.

zaneselvans commented 2 years ago

Okay, I'll try the simplistic (report_date, state, energy_source_code) and (report_date, state, fuel_type_code_pudl) options and see how the coverage compares to what we're currently doing.

Beyond that, rather than immediately jumping to our own bespoke estimator based on adjacent states, I'd like to try a few generic regression options from scikit-learn that can make use of a lot more features in the FRC table, and that can be trained and validated on the records in the table that do have fuel prices (which is most of them).

I guess we can also easily check whether the price per physical unit or the price per MMBTU is more consistent just by plotting the distributions of observed prices in the dataset per date / state / fuel. It seems like whichever variable has less dispersion would be the easier one to estimate accurately. But curious what @katie-lamb and @TrentonBush think with their stats background.

zaneselvans commented 2 years ago

Maybe @priyald17 or @joshdr83 would have suggestions on how best to approach this too.

zaneselvans commented 2 years ago

For context here's the fuel_receipts_costs_eia923 table on Datasette and the field we're trying to estimate is fuel_cost_per_mmbtu. There are a bunch of additional attributes associated with both the plant_id_eia and the mine_id_pudl including the lat/lon of the plants and the county FIPS code associated with many of the coal mines, that could also be used to construct features.

TrentonBush commented 2 years ago

To clarify the problem:

we want to impute redacted fuel costs
we have previously filled them with one of the values from the EIA API, but that is a pain for various operational reasons
those EIA values are mysterious (to us) calculations that we think are something like a weighted average price that includes:
- all the data we have
- AND the redacted data we don't have

Is that right?

I see two separate parts to this:

How well can we model known prices given all the other info we have in PUDL?
Given the EIA API has information we don't have, how can we use that to validate our imputations of unknown prices?

Relatedly, the first few scoping questions I'd ask are:

how much variation is there in fuel price at a given point in time and/or region? (what are the stakes)
how much data is redacted? Are those particularly important for any reason? (what are the stakes)
do we have reason to think redacted costs are systematically different from known costs? (is this method even valid)

Ideally we test the validity, but remember we do have the option to just "assume and caveat" if the stakes are low. Plus, if EIA has done their job, there won't be a good way to test it.

With respect to modelling, I'd let the "what are the stakes" question guide the degree of complexity (simple averages vs fancier models). I'm not familiar with how redaction works (are the same plants always redacted? Does it come and go? etc), but in principle I think we have everything we need to construct a well formulated ML solution, if that is warranted.

katie-lamb commented 2 years ago

I wrote down these questions that are pretty similar to what Trenton asks above:

Are there particular states that are missing more data than other states or is it a somewhat random dispersion of missing data?
- If Nevada was missing a lot of data then there might not be enough recorded Nevada data to provide an accurate estimate based on state average. It also might not necessarily be a good idea to use California as a proxy for fuel costs just because they are geographically close.
Similarly, are there particular time series that are missing data? Are we missing more data from some fuel types than others? Is there a reason for this?

If the redacted data is kind of "one off" and there is still enough representative data, then doing a regression to estimate prices for each date, state, and fuel type could definitely work. But I also agree that we have all the tools and data needed to do something more comprehensive if it seems that the missing data can't be represented that simplistically.

Although I think I got a little lost at some point about the construction of fuel_type_code_pudl vs.energy_source_code and price per MMBTU vs. price per unit, I think it makes sense to do some feature selection with the attributes of the fuel receipts cost table + energy_source_code and attributes from related tables for each of these potential price attributes that we may try and estimate, instead of assuming that nearest states or something is the best estimator. The scores from this feature selection could also reveal something about what price attribute best to estimate. Maybe the k best features are similar for each estimated variable or maybe there's one variable that stands out as more easily modeled.

zaneselvans commented 2 years ago

The fuel_receipts_costs_eia923 table has about 560,000 records in it, and about 67% of them have reported fuel prices.
I'm not sure the monthly statewide average fuel costs do account for the redacted values, though that would be helpful.
There are cases in which no state-level average is available, and I'm not sure why.
Using the statewide averages and the coarse fuel types from the API, we get 91% coverage. using rolling averages only gets us up to 93% so the time gaps are pretty big.

The distribution of missing values is skewed towards the early years (2008-2011) before filling with the API:

After filling with the API it's pretty uniform across all the years:

zaneselvans commented 2 years ago

What do you think would be a good way to evaluate the dispersion of fuel prices? It seems like there are at least 3 dimensions to consider in sampling and estimating fuel prices:

Spatial: multi-state regions, single states, or individual plants.
Temporal: the entire price history, a single year, a single month, or some seasonally correlated periods.
Fuel Type: coarse fuel groups (coal, oil, gas) or individual energy_source_code values.

Using the more granular options (plants, months, energy_source_code) seems like it should give us lower dispersion results, but it'll also reduce the sample sizes pretty dramatically, reducing confidence in the number we get out. For example, here are some state-level distributions for a single year, and a single energy source code. I chose a mix of big states (CA, FL, NY, TX) medium states (CO, WY) and small states (ID, ME). NG and SUB are the most common fuel types.

Price distribution by state for `NG` (2020)

Price distribution by state for `SUB` (2020)

zaneselvans commented 2 years ago

Rather than just counting individual fuel deliveries up, would it make more sense to weight them by the total heat content of the fuel delivered?

TrentonBush commented 2 years ago

The following is ONLY for gas prices (filtering not shown)

monthly median gets you within about 15% of the true value
a state-month median gets you to within about 10% (+- $0.30ish) of the true value

I'm sure we could tighten that much further with a better model. I'm just not sure what our goals should be here without an application in mind. Maybe a reasonable next step would be to throw a low-effort, high-power model like XGBoost at it and see what kind of accuracy is possible.

# I didn't bother with any train-test split here so these error values should be considered best-case indicators
frc['global_median_fuel_price'] = frc['fuel_cost_per_mmbtu'].median()
frc['month_median_fuel_price'] = frc.groupby('report_date')['fuel_cost_per_mmbtu'].transform(np.median)
frc['state_month_median_fuel_price'] = frc.groupby(['state', 'report_date'])['fuel_cost_per_mmbtu'].transform(np.median)

error = frc.loc[:, [
                           'global_median_fuel_price',
                           'month_median_fuel_price',
                           'state_month_median_fuel_price',
                        ]
               ].sub(frc['fuel_cost_per_mmbtu'], axis=0)

pct_error = error.div(frc['fuel_cost_per_mmbtu'], axis=0)

katie-lamb commented 2 years ago

To first address the goal of removing the EIA API calls from the CI asap:

Do you have an accuracy metric on how well the most granular option (plants, months, energy_source_code) performs against the API results? Or how swapping in state level instead of plant affects accuracy? I think for now you could just sub in the median of whatever combo of the three dimensions gets the highest API accuracy. It seems like it might be state, month, energy code. Although @zaneselvans mentioned there are states with no data (I think you actually said a state-level average is not available), so does this necessitate multi-state aggregations?

After this quick fix, I think looking at what other features we could bring into a more powerful model would be useful.

TrentonBush commented 2 years ago

Are the API results our target or are the redacted values our target? I thought the API was basically just an outsourced imputation model. If it includes redacted data in its aggregates then it has value for validation. Is there anything more to the EIA API?

katie-lamb commented 2 years ago

I believe you're right and the API is just modeling these values. After talking to Zane, I think to get a quick fix for today/tomorrow the API results are our target since people currently using the table don't want the values to change significantly. But to develop a more accurate and comprehensive solution in the less short term, the redacted values are indeed the target.

TrentonBush commented 2 years ago

Oh I didn't know there was a timeline on this at all, much less tomorrow! Do what you gotta do

katie-lamb commented 2 years ago

LOL ya I don't think this discussion is really framed around a timeline and is in fact more long term ideating of modeling tactics. My comment was really just spurred by annoyance that the CI is unreliable and wanting something to stand in there while a better solution is developed.

joshdr83 commented 2 years ago

Hey sorry, just tuning in! I did a spatial interpolation for fuel prices to average down to the county level for new build estimates in this paper. Is it mostly the non ISO regions that are short of data?

zaneselvans commented 2 years ago

I used the space, time, and fuel category aggregations I mentioned above, and calculated the median value for each of them for every row in the dataframe, and added those columns to the dataframe for easy comparison.
I calculated the relative error (reported price - estimate prices) / reported price for all of them, to get a sense of how far off they were from the reported values they were built out of.
Petroleum and natural gas based fuels vary a lot month to month, meaning aggregating across time introduces errors.
Petroleum fuel prices are highly correlated regionally and even nationwide, so aggregating them across space is okay.
Coal prices vary by state and region and type of coal, but not much over time, so aggregating up to annual prices isn't so bad if we have to.
I filled in the missing fuel prices iteratively, stating with the most specific estimates, and working my way down to the least specific estimates, sometimes only filling in for a given fuel group (annual prices for coal, regional prices for petroleum).
It also adds a filled_by column indicating what method was used to fill it in if it was missing.
I used national averages to fill in the last ~2% of the records that regional estimates still couldn't get.
Overall this process only takes about 5 seconds on my laptop and gets us a fuel price estimate for every record.

from collections import OrderedDict
import scipy.stats

cols = [
    "report_date",
    "plant_id_eia",
    "state",
    "energy_source_code",
    "fuel_cost_per_mmbtu",
    "fuel_group_code",
    "fuel_consumed_mmbtu", # required for weighting
]

CENSUS_REGIONS = {
    "ENC": ["WI", "MI", "IL", "IN", "OH"],
    "ESC": ["KY", "TN", "MS", "AL"],
    "MAT": ["NY", "PA", "NJ"],
    "MTN": ["MT", "ID", "WY", "NV", "UT", "CO", "AZ", "NM"],
    "NEW": ["ME", "VT", "NH", "MA", "CT", "RI"],
    "PCC": ["CA", "OR", "WA"],
    "PCN": ["AK", "HI"],
    "SAT": ["WV", "MD", "DE", "DC", "VA", "NC", "SC", "GA", "FL"],
    "WNC": ["ND", "SD", "MN", "IA", "NE", "KS", "MO"],
    "WSC": ["OK", "AR", "TX", "LA"],
}

CENSUS_REGION_MAPPING = {state: region for region, states in CENSUS_REGIONS.items() for state in states}

aggs = OrderedDict({
    # The most precise estimator we have right now
    "state_esc_month": {
        "agg_cols": ["state", "energy_source_code", "report_date"],
        "fuel_group_code": None,
    },
    # Good for coal, since price varies much more with location than time
    "state_esc_year": {
        "agg_cols": ["state", "energy_source_code", "report_year"],
        "fuel_group_code": "coal"
    },
    # Good for oil products, because prices are consistent geographically
    "region_esc_month": {
        "agg_cols": ["census_region", "energy_source_code", "report_date"],
        "fuel_group_code": "petroleum",
    },
    # Less fuel specificity, but still precise date and location
    "state_fgc_month": {
        "agg_cols": ["state", "fuel_group_code", "report_date"],
        "fuel_group_code": None,
    },
    # Less location and fuel specificity
    "region_fgc_month": {
        "agg_cols": ["census_region", "fuel_group_code", "report_date"],
        "fuel_group_code": None,
    },
    "region_fgc_year": {
        "agg_cols": ["census_region", "fuel_group_code", "report_year"],
        "fuel_group_code": None,
    },

    "national_esc_month":{
        "agg_cols":  ["energy_source_code", "report_date"],
        "fuel_group_code": None,
    },
    "national_fgc_month": {
        "agg_cols": ["fuel_group_code", "report_date"],
        "fuel_group_code": None,
    },
    "national_fgc_year": {
        "agg_cols": ["fuel_group_code", "report_year"],
        "fuel_group_code": None,
    },
})

frc = (
    pudl_out.frc_eia923()
    .loc[:, cols]
    .assign(
        report_year=lambda x: x.report_date.dt.year,
        census_region=lambda x: x.state.map(CENSUS_REGION_MAPPING),
        fuel_cost_per_mmbtu=lambda x: x.fuel_cost_per_mmbtu.replace(0.0, np.nan),
        filled=lambda x: x.fuel_cost_per_mmbtu,
    )
)

frc["filled_by"] = pd.NA
frc.loc[frc.fuel_cost_per_mmbtu.notna(), ["filled_by"]] = "original"

for agg in aggs:
    agg_cols = aggs[agg]["agg_cols"]
    fgp = aggs[agg]["fuel_group_code"]

    logger.info(f"Aggregating fuel prices by {agg}")
    frc[agg] = (
        frc.groupby(agg_cols)["fuel_cost_per_mmbtu"]
        .transform("median") # Use weighted median to avoid right-skew
    )

    logger.info(f"Calculating error for {agg}")
    frc[agg+"_err"] = (frc[agg] - frc.fuel_cost_per_mmbtu) / frc.fuel_cost_per_mmbtu

    logger.info(f"Filling fgp with {agg}")
    if fgp:
        logger.info(f"Filling only {fgp}")
    mask = (
        (frc.filled.isna())
        & (frc[agg].notna())
        & (frc.fuel_group_code == fgp if fgp else True)
    )
    logger.info(f"Mask selects {sum(mask)} values.")
    frc.loc[mask, "filled_by"] = agg
    frc.loc[mask, "filled"] = frc.loc[mask, agg]

frc.filled_by.value_counts()

original              373841
state_esc_month       127336
region_fgc_month       30420
national_esc_month     10091
region_esc_month        9823
state_fgc_month         6836
national_fgc_month      1079
region_fgc_year          702
state_esc_year           232
Name: filled_by, dtype: int64

Error distributions by fuel group

Generally these lined up with my intuition about which kinds of aggregation would introduce more error (as represented by the interquartile range of the error distribution). The ordering is different for the different fuel types generally though, and we could just automatically apply all of them on a per-fuel basis, in order of increasing IQR. The impact would be pretty marginal, but it would also do the right thing if future data looked different. We could also break out each year separately and apply that process, so that if the nature of the fuel price correlations changes over time, it would capture and and do the right thing each year (e.g. if lots of gas pipelines get built, maybe regional variation in gas prices would go away, and in those later years averaging across large areas would have much less impact than it does now...).

matplotlib.rcParams["figure.figsize"] = (10, 9)
for fgc in ["coal", "petroleum", "natural_gas"]:
    fig, axes = plt.subplots(ncols=3, nrows=3, sharex=True, sharey=True)
    for agg, ax in zip(aggs, axes.flatten()):
        logger.info(f"Plotting {agg} error")
        error = frc.loc[frc.fuel_group_code==fgc, agg+"_err"]
        ax.hist(error[(error != 0.0)], range=(-1,1), bins=200)
        iqr = scipy.stats.iqr(error, nan_policy="omit")
        ax.set_title(f"{agg} (iqr={iqr:0.3})")
        ax.axvline(x=0, lw=1, ls="--", color="black")

    fig.suptitle(f"Fuel Group: {fgc}", fontsize=16)
    fig.tight_layout()
    plt.show();

Filled vs. Reported Distributions

To see how different the filled values are from the reported values, and whether there's an obvious bias being introduced, I made the plots below. It looks like maybe there's a sliiiiight rightward skew in the filled values for coal and natural gas, but not sure it's significant.

fig, axes = plt.subplots(ncols=3)#, sharex=True, sharey=True)
fuel_group_codes = ["coal", "petroleum", "natural_gas"]
matplotlib.rcParams["figure.figsize"] = (10, 4)
for fgc, ax in zip(fuel_group_codes, axes):
    ax.hist(
        frc.loc[frc.fuel_group_code == fgc, "fuel_cost_per_mmbtu"],
        label="reported",
        alpha=0.5,
        range=(0,40),
        bins=80,
        density=True,
    )
    ax.hist(
        frc.loc[(frc.fuel_group_code == fgc) & (frc.fuel_cost_per_mmbtu.isna()), "filled"],
        label="filled",
        alpha=0.5,
        range=(0,40),
        bins=40,
        density=True,
    )
    ax.legend(loc="upper right")
    ax.set_title(f"Fuel Group: {fgc}")
    ax.set_xlabel("Fuel Price [$/MMBTU]")
fig.tight_layout()
plt.show()

zaneselvans commented 2 years ago

To compare the filled in values from the API vs. our aggregations for each fuel group and aggregation method, I made this plot.

sns.relplot(
    data=frc[
        (frc.fuel_group_code.isin(["coal", "petroleum", "natural_gas"]))
        & (frc.fuel_cost_from_eiaapi)
        & (~frc.filled_by.isin(["state_esc_year", "region_fgc_year"]))
    ],
    x="fuel_cost_per_mmbtu",
    y="filled",
    row="fuel_group_code",
    col="filled_by",
    facet_kws=dict(xlim=(0,40), ylim=(0,40)),
    linewidth=0.0,
    alpha=0.1,
    s=5,
)

Here the x-axes are the API based fuel price, and the y-axes are the aggregation based fuel prices. Columns are the aggregation methods, and rows are the fuel groups. In the aggregation methods fgc = fuel group code, and esc = energy source code (the more specific fuels)

For the petroleum and natural_gas fuel groups, there's a good 1:1 correlation between the API and aggregation based prices, but it's harder to see what's going on with coal because the points are all bunched together so I zoomed in on those prices only:

sns.relplot(
    data=frc[
        (frc.fuel_group_code.isin(["coal"]))
        & (frc.fuel_cost_from_eiaapi)
        & (~frc.filled_by.isin(["state_esc_year", "region_fgc_year"]))
    ],
    x="fuel_cost_per_mmbtu",
    y="filled",
    col="filled_by",
    col_wrap=3,
    facet_kws=dict(xlim=(0,8), ylim=(0,8)),
    linewidth=0.0,
    alpha=0.1,
    s=5,
)

For the state-level aggregations, there's some scatter, but generally a correlation between the API and aggregation based fuel prices. However, for the regional and national averages, it seems like there's a more substantial difference between the two methods. Some of the variability that's present in the API results is getting flattened by these larger spatial averages.

This isn't to say that the new method is necessarily worse, since there were fuel type categorizations are worse, given the fuel grouping differences between PUDL and EIA, but it does suggest that there's some benefit to be had by refining the imputation further, especially since one of our main applications of this data is to identify unusually expensive coal plants for advocates. We also have the most additional information about the coal deliveries, since we usually know the mine name, what state it's in, maybe a unique ID for the mine, the supplier name, level of ash and sulfur in the coal, what kind of contract it was delivered on, etc.

Here are the proportions of coal records that are being filled in with the different aggregations:

frc[frc.fuel_group_code=="coal"].filled_by.value_counts() / sum(frc.fuel_group_code=="coal")

original              0.713344
state_esc_month       0.144066
region_fgc_month      0.067036
national_esc_month    0.042347
state_fgc_month       0.026690
national_fgc_month    0.004286
state_esc_year        0.001138
region_fgc_year       0.001093

About 11% of all coal records end up getting regional or national average prices assigned to them, and that's a little more than 1/3 of the total coal records which are being filled in.

I'm inclined to say that this is good enough for now, but that because these are probably some of the records we care most about, it will be worth coming back to this issue and trying to use a more generic model like XGBoost or a random forest that can utilize the other information we have.

What do y'all think?

TrentonBush commented 2 years ago

I was ready to say "yes this is good enough", especially considering the results were so similar to our previous method of EIA API values. But now I'm confused!

Here the x-axes are the API based fuel price

I think I'm not understanding exactly what these plots are -- I thought your scatter plots near the top of this issue showed that API vs state-month average were extremely linear. Why are the state-fgc-month results so much different here? I'd be surprised if changing the agg func from mean to median made that much difference. If it did, that's interesting to know!

zaneselvans commented 2 years ago

@TrentonBush The earlier scatter plots are comparing all the reported data -- so the ones where there actually was data in the FRC table, and they're only being aggregated by [state, month, fuel_group]

The more scatter recent plots are only looking at data points that were not present in the FRC table, and comparing the values which were filled in by our new method (breaking it out into all the different kinds of aggregation used) vs. the API values. So it's not surprising that the correlation is worse in general.

TrentonBush commented 2 years ago

Oh ok, now if I squint I can see how they are related -- basically remove the line and just look at the dispersion, plus add some x-coordinate noise to account for mean to median. Ya this model makes sense to me. I think using the median is justified given the handful of big outliers I ran into. We definitely want to use the mean for the purpose of disaggregating the API results in the future though.

one of our main applications of this data is to identify unusually expensive coal plants for advocates

That's an awesome use case I'd love to learn more about! There is certainly more accuracy to wring out of this data, and that kind of anomaly detection would probably require it. Maybe we find a day or two next sprint to try other models

zaneselvans commented 2 years ago

Yeah, there are some wild outliers. I want to figure out how to remove them and fill them in with reasonable values. Some of them are out of bounds by factors of like 1000x -- natural gas is particularly messy. The mean values were quite skewed.

Given that the values aren't really normally distributed, would it still make sense to replace anything more than 3-4$\sigma$ away from the mean? I could drop the top and bottom 0.1% of the distribution too, but there aren't always big outliers, and sometimes more than 0.1% of the points are bad. Not sure what the right thing to do is.

zaneselvans commented 2 years ago

I'm doing some additional normalization to the table and moving the fuel_group_code to the energy_sources_eia table, since the value is entirely determined by energy_source_code with this mapping:

assert (frc.groupby("energy_source_code").fuel_group_code.nunique() == 1).all()
dict(frc.groupby("energy_source_code").fuel_group_code.first())

{'BFG': 'other_gas',
 'BIT': 'coal',
 'DFO': 'petroleum',
 'JF': 'petroleum',
 'KER': 'petroleum',
 'LIG': 'coal',
 'NG': 'natural_gas',
 'OG': 'other_gas',
 'PC': 'petroleum_coke',
 'PG': 'other_gas',
 'RFO': 'petroleum',
 'SC': 'coal',
 'SGP': 'other_gas',
 'SUB': 'coal',
 'WC': 'coal',
 'WO': 'petroleum'}

I'm also renaming the column to energy_group_eiaepm since it's the fuel grouping that's used for reporting in the EIA Electric Power Monthly.

zaneselvans commented 2 years ago

That's an awesome use case I'd love to learn more about!

This is basically the RMI ReFi project! The fuel is by far the biggest component of the OpEx for coal plants, and this data has been used to identify which plants are vulnerable to getting refinanced out of existence.

zaneselvans commented 2 years ago

Okay, definitely some room for improvement. E.g. look at how how coal prices in NY drop dramatically (in the filled in data) starting in 2012 here:

def facet_fuel(
    frc: pd.DataFrame,
    states: list[str],
    fuel_facet: Literal["energy_source_code", "fuel_group_eiaepm", "fuel_type_code_pudl"],
    by: Literal["fuel", "state"] = "state",
    fuels: list[str] | None = None,
    max_price: float = 100.0,
    facet_kws: dict | None = None,
) -> None:
    mask = (
        (frc.state.isin(states))
        & (frc.fuel_cost_per_mmbtu <= max_price)
        & (frc[fuel_facet].isin(fuels) if fuels else True)
    )
    facet = fuel_facet if by == "state" else "state"
    sns.relplot(
        data=frc[mask],
        x="report_date",
        y="fuel_cost_per_mmbtu",
        hue=facet,
        style=facet,
        row=fuel_facet if by == "fuel" else "state",
        kind="line",
        height=4,
        aspect=2.5,
        facet_kws=facet_kws,
    );

facet_fuel(
    frc,
    fuel_facet="fuel_group_eiaepm",
    states=["CA", "NY", "FL", "TX"],
    fuels=["coal", "natural_gas", "petroleum"],
    by="fuel",
    facet_kws={"sharey": False},
)

zaneselvans commented 2 years ago

I ran tox -e nuke to generate a new database with the slight schema changes, and validate its contents, and ended up getting the capacity_mw and row count errors that @katie-lamb was encountering after the date_merge updates, so I merged the changes from her current PR (#1695) and re-ran the validations, but I'm still getting a bunch of row count errors, so maybe they really are due to something that changed here?

FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 2507830 rows, expected 2507040. Off by 0.032%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 214669 rows, expected 214613. Off by 0.026%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_nonuclear_eia923-2492441-2491704-213329] - ValueError: gf_nonuclear_eia923: found 213382 rows, expected 213329. Off by 0.025%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_nuclear_eia923-23498-23445-1961] - ValueError: gf_nuclear_eia923: found 1964 rows, expected 1961. Off by 0.153%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 2507830 rows, expected 2507040. Off by 0.032%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_nonuclear_eia923-2492441-2491704-213329] - ValueError: gf_nonuclear_eia923: found 2492441 rows, expected 2491704. Off by 0.030%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_nuclear_eia923-23498-23445-1961] - ValueError: gf_nuclear_eia923: found 23498 rows, expected 23445. Off by 0.226%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%

It's not clear off the top of my head why the changes in this PR would result in small changes to the number of rows in these other tables though. Does that make sense to anyone else?

zaneselvans commented 2 years ago

Comparing old expectations, new expectations, and the rows we've got here...

table	observed rows	old rows	date_merge rows	matches
gens_eia860_raw	491,469	491,469	490,824	old
gens_eia860_annual	491,469	491,469	490,824	old
gens_eia860_monthly	491,469	491,469	490,824	old
gf_eia923_raw	2,507,830	2,507,870	2,507,040	neither
gf_eia923_annual	214,669	214,709	214_613	neither
gf_eia923_monthly	2,507,830	2,507,870	2,507,040	neither
gf_nonuclear_eia923_annual	213,382	213,422	213,329	neither
gf_nonuclear_eia923_monthly	2,492,441	2,492,481	2,491,704	neither
gf_nuclear_eia923_annual	1,964	1,964	1,961	old
gf_nuclear_eia923_monthly	23,498	23,498	23,445	old
plants_eia860_raw	171,570	171,570	171,148	old
plants_eia860_annual	171,570	171,570	171,148	old
plants_eia860_monthly	171,570	171,570	171,148	old
pu_eia860_raw	170,792	170,792	170,370	old
pu_eia860_annual	170,792	170,792	170,370	old
pu_eia860_monthly	170,792	170,792	170,370	old
utils_eia860_raw	113,464	113,464	113,357	old
utils_eia860_annual	113,464	113,464	113,357	old
utils_eia860_monthly	113,464	113,464	113,357	old

zaneselvans commented 2 years ago

More weirdness.

The number of FRC records filled with each of the different aggregations seems a little different now than it was when I was originally developing the process in a notebook.

After inserting it into the midst of the pudl.output.fuel_receipts_costs_eia923() function only one of the national aggregations has any records associated with it.
We lose 11,039 records when date_merge() is run on the FRC output dataframe near the end of the output function. Need to figure out what those records are and why.
It also turns out there are about 10,500 records in the FRC table whose plant_id_eia values don't have states associated with them in the plants_entity_eia table, which seems pretty weird. They all start with 88, and their plant records are almost entirely null.
The stateless FRC records are relatively flatly distributed across years. They're 7236 coal, 1605 natural gas, 1504 petroleum, and 166 petcoke.

The lack of states is surprising, but seems like that's just how it is. The loss of a bunch of records in the date_merge() process is confusing. @katie-lamb and @cmgosnell could we talk about what might be going on there? I'm wondering if it might be due to the fact that the FRC table isn't a "time series" in that the report_date field isn't part of a primary key -- it's just a bunch of transactions, and there are some valid duplicate records. We use an autogenerated pseudo-key in the database. I'll compile a dataframe of the records that are getting lost.

katie-lamb commented 2 years ago

After running the ETL, I updated the row counts and merged that PR into dev so that should fix some of your issues @zaneselvans

zaneselvans commented 2 years ago

Ah okay so some of the row counts in your branch weren't actually the expected ones?

zaneselvans commented 2 years ago

@katie-lamb any intuition as to why merge_date() would be causing ~11,000 rows to get dropped from the FRC table?

zaneselvans commented 2 years ago

Comparing the records that exist before and after date_merge() it looks like all of the records for plants that are missing a state field (10,511) are getting dropped for some reason, plus 528 additional records that do have states associated with them. Note that the plant IDs that don't have states are also overwhelmingly NULL in the entity tables, so if there's any additional merging that's happening on other columns in there maybe that would mess things up?

Are these valid duplicate records that are getting treated strangely? If I list all of the duplicate records in the FRC table that I've pulled directly from the DB, it gives me 3510 of them, and all but one of them has a state, so it doesn't seem to be that.

In the same method chain where date_merge() is happening there's also a dropna(subset=["utility_id_eia"]) -- I bet that's where the problem is. The plants that don't have states are also missing almost everything else, so probably they're missing utility IDs. And then a smattering of the other plants just happen to be missing Utility IDs for some other reason.

Counting the rows in the dataframe before / after that dropna() specifically, that's indeed where the 11,000 records drop out. Whoops.

For the purposes of the FRC table is it absolutely necessary that we have the utility ID? The data is really about plants, fuels, and dates, so it seems like this isn't strictly required. If I remove it what happens...

In that case, we keep all the FRC records, but all of the fields which are brought in by merging with the results of pudl.output.eia860.plants_utils_eia860() are NA (plant name, PUDL ID, etc.). which may cause other unexpected problems downstream, so this probably isn't ideal, and the real issue is with how plants_utils_eia860() is working in cases where there are some missing ID fields, so I think I should make an issue to deal with this... #1700

zaneselvans commented 2 years ago

Looking for ways to identify outliers in not-so-normal distributions I came across the modified z-score, which is analogous to the z-score in normal distributions, but instead of looking at how many standard deviations away from the mean a data point is, it looks at how many Median Absolute Deviations (MADs) away from the median it is:

mod_zscore = abs(0.6745 * (data - data.median()) / data.mad())

This seems like the kind of measure we want.

In messing around with this I also ended up looking at the median (aka delivery weighted median) vs. (mmbtu) weighted median values of fuel prices in the FRC table, and they're pretty different. I think this is because crazy expensive fuel deliveries tend to be very small -- so they make up a disproportionate number of deliveries (records) compared to how much heat content they represent. The delivery weighted median is $3.25 / MMBTU but the MMBTU weighted median is $2.09 / MMBTU.

A weighted median function that can be used with df.groupby().apply():

def weighted_median(df: pd.DataFrame, data: str, weights: str) -> float:
    df = df.dropna(subset=[data, weights])
    s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))
    midpoint = 0.5 * np.sum(s_weights)
    if any(df[weights] > midpoint):
        w_median = df[data][weights == np.max(df[weights])].array[0]
    else:
        cs_weights = np.cumsum(s_weights)
        idx = np.where(cs_weights <= midpoint)[0][-1]
        if cs_weights[idx] == midpoint:
            w_median = np.mean(s_data[idx:idx+2])
        else:
            w_median = s_data[idx+1]
    return w_median

Grouping by report_year and fuel_group_eiaepm and calculating the modified z-score within each group, we get a pretty clean normal-ish distribution, with some long straggling tails that I think we can probably cut off.

DATA_COLS = [
    "plant_id_eia",
    "report_date",
    "energy_source_code",
    "fuel_cost_per_mmbtu",
    "fuel_received_units",
    "fuel_mmbtu_per_unit",
]

GB_COLS = ["report_year", "fuel_group_eiaepm"]

MAX_MOD_ZSCORE = 5.0

logger.info("Query PUDL DB")
frc_db = pd.read_sql("fuel_receipts_costs_eia923", pudl_engine, columns=DATA_COLS)
plant_states = pd.read_sql("SELECT plant_id_eia, state FROM plants_entity_eia", pudl_engine)
fuel_group_eiaepm = pd.read_sql("SELECT code AS energy_source_code, fuel_group_eiaepm FROM energy_sources_eia", pudl_engine)

logger.info("Join tables and calculate derived values")
frc_db = (
    frc_db
    .merge(plant_states, on="plant_id_eia", how="left", validate="many_to_one")
    .merge(fuel_group_eiaepm, on="energy_source_code", how="left", validate="many_to_one")
    .assign(
        report_year=lambda x: x.report_date.dt.year,
        fuel_mmbtu_total=lambda x: x.fuel_received_units * x.fuel_mmbtu_per_unit,
    )
    .pipe(apply_pudl_dtypes, group="eia")
)

logger.info("Calculate weighted median fuel price")
weighted_median_fuel_price = (
    frc_db
    .groupby(GB_COLS)
    .apply(weighted_median, data="fuel_cost_per_mmbtu", weights="fuel_mmbtu_total")
)
weighted_median_fuel_price.name = "weighted_median_fuel_price"
weighted_median_fuel_price = weighted_median_fuel_price.to_frame().reset_index()
frc_db = frc_db.merge(weighted_median_fuel_price, how="left", on=GB_COLS, validate="many_to_one")

logger.info("Calculate weighted fuel price MAD")
frc_db["delta"] = abs(frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price)
frc_db["fuel_price_mad"] = frc_db.groupby(GB_COLS)["delta"].transform("median")

logger.info("Calculate modified z-score")
frc_db["fuel_price_mod_zscore"] = (0.6745 * (frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price) / frc_db.fuel_price_mad).abs()

fraction_outliers = sum(frc_db.fuel_price_mod_zscore > MAX_MOD_ZSCORE) / len(frc_db)
logger.info(f"Modified z-score {MAX_MOD_ZSCORE} labels {fraction_outliers:0.2%} of all prices as outliers")

All fuels:

plt.hist(frc_db.fuel_price_mod_zscore, bins=80, range=(0,8), density=True)
plt.title("Modified z-score when grouped by fuel and year");

Fuel groups

sns.displot(
    data=frc_db[frc_db.fuel_group_eiaepm.isin(["coal", "natural_gas", "petroleum"])],
    kind="hist",
    x="fuel_price_mod_zscore",
    row="fuel_group_eiaepm",
    bins=80,
    binrange=(0,8),
    height=2.5,
    aspect=3,
    stat="density",
    linewidth=0,
);

zaneselvans commented 2 years ago

I think have weighted, modified z-score outlier detection working, but am wondering how to apply it appropriately.

Calculate the wmod_zscore for high-level aggregations (like year, fuel group) at the beginning and then let each of the subsequent aggregations fill in the NAs.
Calculate the wmod_zscore for each individual aggregation as we go, so that "outliers" are being identified relative to the distribution whose median value would be getting used to fill in missing / outlying values.

The latter sounds more correct, but some of the groupings won't have a large number of records in them.

Also: what's the right wmod_zscore value to use as the cutoff for identifying outliers? At wmod_zscore == 5.0 it identifies about 1.5% of all records as outliers, mostly natural gas. Those outlier records account for only 0.2% of all reported MMBTU.

zaneselvans commented 2 years ago

Outliers identified using modified z-score of aggregations, with max_mod_zscore == 5.0:

agg_mod_zscore=["report_year", "fuel_group_eiaepm"]: 1.5% of records (0.2% of mmbtu). No empty groups.
agg_mod_zscore=["report_year", "fuel_group_eiaepm", "census_region"]: 1.73% of records (0.37% of mmbtu). A few empty groups.
agg_mod_zscore=["report_year", "fuel_group_eiaepm", "state"]: 1.89% of records (0.74% of mmbtu). Lots of empty groups.
agg_mod_zscore=["report_date", "fuel_group_eiaepm", "census_region"]: 2.18% of records (0.89% of mmbtu). Lots of empty groups.

TrentonBush commented 2 years ago

"Robust" outlier detection metrics are great! Their downsides are that 1) they can be too aggressive, especially on skewed data and 2) as you've noticed, they are slow because of all the O(n*logn) sorting.

I'm of two minds here. On one hand, it'd be fun and beneficial to tune this model for both accuracy and performance. On the other, we're considering replacing all this with a better model anyway, so maybe we should wait on any tuning until we make that determination. This outlier detection problem probably fits in better the with "improve the imputation model" tasks because both imputation and outlier detection might use the same model. Also, outlier detection is a scope increase from the original "remove the EIA API" goals. The very linear plots of state-month averages vs EIA API imply that EIA isn't doing much outlier removal, which means our users have been using data with outliers averaged in the whole time. I think it makes sense to change things once and well rather than possibly twice.

Accuracy improvement ideas

I'd start by seeing if we can un-skew the price distributions before (and maybe instead of) applying the robust metrics. A variance-stabilizing transform like log or Box-Cox might make the data "normal" enough to use regular z-scores.

Second, have we checked what the minimum sample sizes are for some of these bins? As far as I can tell, the existing method calculates a bunch of types of aggregates and iteratively fills in NaN values in order of model precision. Sometimes a key combination like ["Rhode Island", "2008-01-01", "Coal"] might be empty and produce np.nan, which is later filled by a coarser aggregate, but other times it might only have a handful of plants and produce weird metrics as a result. Now that we're using this model for both prediction and outlier detection, it could cause problems in both.

Finally, the model we've built here is basically a decision tree, just manually constructed instead of automated. Problems like choosing the right grain of aggregation or enforcing a minimum sample size per bin are taken care of automatically by those algorithms!

Performance improvement ideas

Medians are always going to be slower, but the line s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights])))) jumped out at me because of the use of python iterators on pandas/numpy objects. I think we could probably get the same result faster with something involving df.loc[:, [data, weights]].sort_values(data), df.loc[:, weights].cumsum(), and np.searchsorted

zaneselvans commented 2 years ago

I will admit that

 s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))

was just something I adapted from an example which was operating on lists or numpy arrays, and I didn't completely parse what it is doing to try and re-write it intelligently... like what is going on with the zip inside a zip and unpacking the sorted list. Is it obvious to you what it's doing? I'm not even sure what it means to sort a list of tuples necessarily.

zaneselvans commented 2 years ago

The minimum sample size in some of these bins is definitely small, but it'll use the small sample if it can (even if that's a bad idea). Requiring a bigger minimum sample size would result in more frequently resorting to the less specific estimations, but with many more samples.

Should we have just gone straight to using one of the Swiss-army-knife regression models instead, and just started with a modest set of input columns as features, to be expanded / engineered more later? I haven't really worked with them so I don't know how they're finicky or what the pitfalls are.

TrentonBush commented 2 years ago

I will admit that was just something I adapted from an example

Let he who has not copied from StackOverflow cast the first stone 😂

I had no clue what sorting tuples did, but I just tried it and it sorts them by their first value. I think that pattern is designed for python objects: when given iterables representing columns, the inner zip links the values by row so that sorted doesn't break the connection between them, then the outer zip puts them back into columnar layout. Mapping np.array then turns each column into a separate array. But pandas can accomplish the same thing with df.sort_values(<my_column>) and some indexing.

Also I just noticed the python any() function, which I think will also resort to python iteration. Using df.any() or np.any() will be much faster (relatively, probably not even 100ms difference at n=500k).

Should we have just gone straight to using one of the Swiss-army-knife regression models instead

No, starting with groupby is much simpler, plus it matched the algorithm we were trying to replace. But once we start adding complexity like conditionally switching between groupbys and checking sample sizes, I think that effort is better spent on a model with a higher ceiling.

zaneselvans commented 2 years ago

Okay here's a pandas-ified version

def weighted_median(df: pd.DataFrame, data: str, weights: str, dropna=True) -> float:
    if dropna:
        df = df.dropna(subset=[data, weights])
    if df.empty | df[[data, weights]].isna().any(axis=None):
        return np.nan
    df = df.loc[:, [data, weights]].sort_values(data)
    midpoint = 0.5 * df[weights].sum()
    if (df[weights] > midpoint).any():
        w_median = df.loc[df[weights].idxmax(), data]
    else:
        cs_weights = df[weights].cumsum()
        idx = np.where(cs_weights <= midpoint)[0][-1]
        if cs_weights.iloc[idx] == midpoint:
            w_median = df[data].iloc[idx : idx + 2].mean()
        else:
            w_median = df[data].iloc[idx + 1]
    return w_median

Old version: 4min 26s New version: 5min 33s 🤣

But they do produce exactly the same output!

TrentonBush commented 2 years ago

Moving some bits from Slack to here for posterity.

I tried (notebook) a quick implementation of an XGBoost equivalent (lightGBM). Using the same state, month, fuel features as the groupby model, it gave very similar accuracy. After adding a few new features, it cut relative error in about half. Training/inference time were around 10 seconds total Cross validation shows that the groupby model overfits and performs a bit worse on unseen data. Also, note the difference in counts below. I checked whether the GBDT accuracy was different on the records that the groupby model could predict, but the only significant difference was that other_gas results (now only 9 of them) were much better.

Test set relative error for groupby(['state', 'report_date', 'fuel_group_code']) medians

Test set results for GBDT. Features: ['state', 'report_date', 'fuel_group_code', 'contract_type_code', 'primary_transportation_mode_code', 'energy_source_code', 'fuel_received_units', 'latitude', 'longitude',]

zaneselvans commented 2 years ago

I'm closing this because it's effectively metastasized into #1708

catalyst-cooperative / pudl

Estimate monthly plant-level fuel prices w/o using the EIA API #1343

Motivation

Approach

Choosing Aggregations

Intuitive Aggregation Priorities

Questions:

Other Potential Refinements

Remaining tasks:

API vs. frc_eia923

Coal

Oil

Gas

What to do?

Price distribution by state for NG (2020)

Price distribution by state for SUB (2020)

Error distributions by fuel group

Filled vs. Reported Distributions

All fuels:

Fuel groups

Accuracy improvement ideas

Performance improvement ideas

API vs. `frc_eia923`

Price distribution by state for `NG` (2020)

Price distribution by state for `SUB` (2020)