Closed zaneselvans closed 2 years ago
Hey @katherinelamb I'd like to talk to you about how we might use some ML / regressions to fill in these missing fuel costs if that's an issue you might like to take on.
This definitely sounds interesting and doable.
Potential data for features
It's probably best to start by trying to reduce the problem to just be either spatial or temporal and then try some regressions and common modeling frameworks. Seems like it's harder but still doable to model both spatially and temporally.
A novel framework for spatio-temporal prediction
While we do have the lat/lon location for all the plants, for the "spatial" dimension I was imagining something like looking at the prices reported in adjacent states if they're available, since I imagine that there are generally regional price trends. I think this is what Aranya / Greg did. E.g. if a given state-month-fuel combination is missing a price, fill it in with the average of the reported price for adjacent states for that fuel and month, and then that filled-in state-month-fuel table could be used to estimate the cost per unit of fuel for any records in the fuel_receipts_costs_eia923
table that have redacted values. But maybe that's more vague and complex than it needs to be?
I guess in an ideal world we'd be able to make predictions / estimations for every record with a missing value, given all of the information available in a very generic way, but that seems like asking a lot. There would be a lot of categorical variables (state, buyer, seller, fuel type, spot market vs. contract purchase) numerical values (quantity of fuel purchased, heat content per unit of fuel, cost of fuel for that month in all adjacent states, or the region, or the country as a whole). And really you could bring in any number of utility, plant, and generator attributes if they were helpful.
I looked into this today and found some things.
COW
), all petroleum derived fuels (PEL
), and all natural gas (NG
).energy_source_code
column that we could use for missing price estimation if we wanted to.fuel_type_code_pudl
(coal
, oil
, gas
)... but we lump together different energy_source_code
values in our categories than EIA does, which has to be introducing some misalignment between the filled in prices we're using, and what we'll get if we calculate monthly per-state, per-fuel costs.energy_sources_eia
coding table, we could choose to organize things differently if we wanted to, or potentially just use their categorizations instead of rolling our own.frc_eia923
Here's how the fuel API prices compare to an aggregation from the reported data using the (slightly different) high level categories. Each point corresponds to a particular (report_date
, state
, fuel_type
) combo.
(note the different X and Y scales because of some outliers)
fuel_type_code_pudl
so that it does correspond to the EIA categorizations, and that ought to bring the two calculations closer together.report_date
, state
, energy_source_code
) rather than fuel_type_code_pudl
, using whatever information is available in the FRC table. Or I guess we don't necessarily have to restrict ourselves to just looking at state-level estimates. We could try and estimate individual plant level prices and see how it does.I'm curious what @cmgosnell thinks about the simpler options above as an interim change.
For context on the original choice to use the larger categories, iirc we went this the fuel_type_code_pudl
so that we'd get higher coverage. iirc the original way we were doing it (i.e. not what is currently in dev
) there was some (unnecessary) limiting factor, so we went with the broad fuel types to get broader coverage.
I agree generally that we can and probably should use the more granular energy_source_code
. I assume that whatever we do won't align perfectly with EIA (in part because I could never find their methodology for how they generated these state-level values). I would personally not work with option 2 above (reform fuel_type_code_pudl
). The desire imo is something defensible, not something that perfectly matches EIA.
Do you think it makes sense to swap in our own average fuel prices based on the fuel_type_code_pudl
aggregations for the time being, if it gives us coverage that's similar to what we're getting from the EIA API (which IIRC, wasn't great)?
Another thought: does it make more sense to try and estimate the price per unit or the price per MMBTU? (where unit is ton, barrel, or mcf). I would think that within these big categories there would be less dispersion on a $/MMBTU basis, since what you're really buying is heat content, not mass (i.e. BIT costs much more than LIG per ton, but they're closer to each other per MMBTU). I think we have the option of doing either one.
I definitely think using fuel_type_code_pudl
for the time being makes sense, but i assume its similar math to try it with energy_source_code
, so I'd personally check the coverage of the latter before defaulting to fuel_type_code_pudl
.
I am less familiar with the unit vs btu distinction, but I think you are on to something! If all the info is there (which I believe it is in the frc table), then yeah go for it... but again i'd check the energy_source_code
first. iirc the original coverage of the API fuel source was artificially limited because of something I did that was silly and was later fixed. My gut tells me that we'd get pretty good overall coverage using just energy_source_code
with our own data.
I would almost use state-level energy_source_code
prices and then try nearby states or just a national average.
Okay, I'll try the simplistic (report_date
, state
, energy_source_code
) and (report_date
, state
, fuel_type_code_pudl
) options and see how the coverage compares to what we're currently doing.
Beyond that, rather than immediately jumping to our own bespoke estimator based on adjacent states, I'd like to try a few generic regression options from scikit-learn
that can make use of a lot more features in the FRC table, and that can be trained and validated on the records in the table that do have fuel prices (which is most of them).
I guess we can also easily check whether the price per physical unit or the price per MMBTU is more consistent just by plotting the distributions of observed prices in the dataset per date / state / fuel. It seems like whichever variable has less dispersion would be the easier one to estimate accurately. But curious what @katie-lamb and @TrentonBush think with their stats background.
Maybe @priyald17 or @joshdr83 would have suggestions on how best to approach this too.
For context here's the fuel_receipts_costs_eia923 table on Datasette and the field we're trying to estimate is fuel_cost_per_mmbtu
. There are a bunch of additional attributes associated with both the plant_id_eia
and the mine_id_pudl
including the lat/lon of the plants and the county FIPS code associated with many of the coal mines, that could also be used to construct features.
To clarify the problem:
Is that right?
I see two separate parts to this:
Relatedly, the first few scoping questions I'd ask are:
Ideally we test the validity, but remember we do have the option to just "assume and caveat" if the stakes are low. Plus, if EIA has done their job, there won't be a good way to test it.
With respect to modelling, I'd let the "what are the stakes" question guide the degree of complexity (simple averages vs fancier models). I'm not familiar with how redaction works (are the same plants always redacted? Does it come and go? etc), but in principle I think we have everything we need to construct a well formulated ML solution, if that is warranted.
I wrote down these questions that are pretty similar to what Trenton asks above:
If the redacted data is kind of "one off" and there is still enough representative data, then doing a regression to estimate prices for each date, state, and fuel type could definitely work. But I also agree that we have all the tools and data needed to do something more comprehensive if it seems that the missing data can't be represented that simplistically.
Although I think I got a little lost at some point about the construction of fuel_type_code_pudl
vs.energy_source_code
and price per MMBTU vs. price per unit, I think it makes sense to do some feature selection with the attributes of the fuel receipts cost table + energy_source_code
and attributes from related tables for each of these potential price attributes that we may try and estimate, instead of assuming that nearest states or something is the best estimator. The scores from this feature selection could also reveal something about what price attribute best to estimate. Maybe the k best features are similar for each estimated variable or maybe there's one variable that stands out as more easily modeled.
fuel_receipts_costs_eia923
table has about 560,000 records in it, and about 67% of them have reported fuel prices.The distribution of missing values is skewed towards the early years (2008-2011) before filling with the API:
After filling with the API it's pretty uniform across all the years:
What do you think would be a good way to evaluate the dispersion of fuel prices? It seems like there are at least 3 dimensions to consider in sampling and estimating fuel prices:
energy_source_code
values.Using the more granular options (plants, months, energy_source_code
) seems like it should give us lower dispersion results, but it'll also reduce the sample sizes pretty dramatically, reducing confidence in the number we get out. For example, here are some state-level distributions for a single year, and a single energy source code. I chose a mix of big states (CA, FL, NY, TX) medium states (CO, WY) and small states (ID, ME). NG
and SUB
are the most common fuel types.
NG
(2020)SUB
(2020)Rather than just counting individual fuel deliveries up, would it make more sense to weight them by the total heat content of the fuel delivered?
The following is ONLY for gas prices (filtering not shown)
I'm sure we could tighten that much further with a better model. I'm just not sure what our goals should be here without an application in mind. Maybe a reasonable next step would be to throw a low-effort, high-power model like XGBoost at it and see what kind of accuracy is possible.
# I didn't bother with any train-test split here so these error values should be considered best-case indicators
frc['global_median_fuel_price'] = frc['fuel_cost_per_mmbtu'].median()
frc['month_median_fuel_price'] = frc.groupby('report_date')['fuel_cost_per_mmbtu'].transform(np.median)
frc['state_month_median_fuel_price'] = frc.groupby(['state', 'report_date'])['fuel_cost_per_mmbtu'].transform(np.median)
error = frc.loc[:, [
'global_median_fuel_price',
'month_median_fuel_price',
'state_month_median_fuel_price',
]
].sub(frc['fuel_cost_per_mmbtu'], axis=0)
pct_error = error.div(frc['fuel_cost_per_mmbtu'], axis=0)
To first address the goal of removing the EIA API calls from the CI asap:
Do you have an accuracy metric on how well the most granular option (plants, months, energy_source_code) performs against the API results? Or how swapping in state level instead of plant affects accuracy? I think for now you could just sub in the median of whatever combo of the three dimensions gets the highest API accuracy. It seems like it might be state, month, energy code. Although @zaneselvans mentioned there are states with no data (I think you actually said a state-level average is not available), so does this necessitate multi-state aggregations?
After this quick fix, I think looking at what other features we could bring into a more powerful model would be useful.
Are the API results our target or are the redacted values our target? I thought the API was basically just an outsourced imputation model. If it includes redacted data in its aggregates then it has value for validation. Is there anything more to the EIA API?
I believe you're right and the API is just modeling these values. After talking to Zane, I think to get a quick fix for today/tomorrow the API results are our target since people currently using the table don't want the values to change significantly. But to develop a more accurate and comprehensive solution in the less short term, the redacted values are indeed the target.
Oh I didn't know there was a timeline on this at all, much less tomorrow! Do what you gotta do
LOL ya I don't think this discussion is really framed around a timeline and is in fact more long term ideating of modeling tactics. My comment was really just spurred by annoyance that the CI is unreliable and wanting something to stand in there while a better solution is developed.
Hey sorry, just tuning in! I did a spatial interpolation for fuel prices to average down to the county level for new build estimates in this paper. Is it mostly the non ISO regions that are short of data?
filled_by
column indicating what method was used to fill it in if it was missing.from collections import OrderedDict
import scipy.stats
cols = [
"report_date",
"plant_id_eia",
"state",
"energy_source_code",
"fuel_cost_per_mmbtu",
"fuel_group_code",
"fuel_consumed_mmbtu", # required for weighting
]
CENSUS_REGIONS = {
"ENC": ["WI", "MI", "IL", "IN", "OH"],
"ESC": ["KY", "TN", "MS", "AL"],
"MAT": ["NY", "PA", "NJ"],
"MTN": ["MT", "ID", "WY", "NV", "UT", "CO", "AZ", "NM"],
"NEW": ["ME", "VT", "NH", "MA", "CT", "RI"],
"PCC": ["CA", "OR", "WA"],
"PCN": ["AK", "HI"],
"SAT": ["WV", "MD", "DE", "DC", "VA", "NC", "SC", "GA", "FL"],
"WNC": ["ND", "SD", "MN", "IA", "NE", "KS", "MO"],
"WSC": ["OK", "AR", "TX", "LA"],
}
CENSUS_REGION_MAPPING = {state: region for region, states in CENSUS_REGIONS.items() for state in states}
aggs = OrderedDict({
# The most precise estimator we have right now
"state_esc_month": {
"agg_cols": ["state", "energy_source_code", "report_date"],
"fuel_group_code": None,
},
# Good for coal, since price varies much more with location than time
"state_esc_year": {
"agg_cols": ["state", "energy_source_code", "report_year"],
"fuel_group_code": "coal"
},
# Good for oil products, because prices are consistent geographically
"region_esc_month": {
"agg_cols": ["census_region", "energy_source_code", "report_date"],
"fuel_group_code": "petroleum",
},
# Less fuel specificity, but still precise date and location
"state_fgc_month": {
"agg_cols": ["state", "fuel_group_code", "report_date"],
"fuel_group_code": None,
},
# Less location and fuel specificity
"region_fgc_month": {
"agg_cols": ["census_region", "fuel_group_code", "report_date"],
"fuel_group_code": None,
},
"region_fgc_year": {
"agg_cols": ["census_region", "fuel_group_code", "report_year"],
"fuel_group_code": None,
},
"national_esc_month":{
"agg_cols": ["energy_source_code", "report_date"],
"fuel_group_code": None,
},
"national_fgc_month": {
"agg_cols": ["fuel_group_code", "report_date"],
"fuel_group_code": None,
},
"national_fgc_year": {
"agg_cols": ["fuel_group_code", "report_year"],
"fuel_group_code": None,
},
})
frc = (
pudl_out.frc_eia923()
.loc[:, cols]
.assign(
report_year=lambda x: x.report_date.dt.year,
census_region=lambda x: x.state.map(CENSUS_REGION_MAPPING),
fuel_cost_per_mmbtu=lambda x: x.fuel_cost_per_mmbtu.replace(0.0, np.nan),
filled=lambda x: x.fuel_cost_per_mmbtu,
)
)
frc["filled_by"] = pd.NA
frc.loc[frc.fuel_cost_per_mmbtu.notna(), ["filled_by"]] = "original"
for agg in aggs:
agg_cols = aggs[agg]["agg_cols"]
fgp = aggs[agg]["fuel_group_code"]
logger.info(f"Aggregating fuel prices by {agg}")
frc[agg] = (
frc.groupby(agg_cols)["fuel_cost_per_mmbtu"]
.transform("median") # Use weighted median to avoid right-skew
)
logger.info(f"Calculating error for {agg}")
frc[agg+"_err"] = (frc[agg] - frc.fuel_cost_per_mmbtu) / frc.fuel_cost_per_mmbtu
logger.info(f"Filling fgp with {agg}")
if fgp:
logger.info(f"Filling only {fgp}")
mask = (
(frc.filled.isna())
& (frc[agg].notna())
& (frc.fuel_group_code == fgp if fgp else True)
)
logger.info(f"Mask selects {sum(mask)} values.")
frc.loc[mask, "filled_by"] = agg
frc.loc[mask, "filled"] = frc.loc[mask, agg]
frc.filled_by.value_counts()
original 373841
state_esc_month 127336
region_fgc_month 30420
national_esc_month 10091
region_esc_month 9823
state_fgc_month 6836
national_fgc_month 1079
region_fgc_year 702
state_esc_year 232
Name: filled_by, dtype: int64
Generally these lined up with my intuition about which kinds of aggregation would introduce more error (as represented by the interquartile range of the error distribution). The ordering is different for the different fuel types generally though, and we could just automatically apply all of them on a per-fuel basis, in order of increasing IQR. The impact would be pretty marginal, but it would also do the right thing if future data looked different. We could also break out each year separately and apply that process, so that if the nature of the fuel price correlations changes over time, it would capture and and do the right thing each year (e.g. if lots of gas pipelines get built, maybe regional variation in gas prices would go away, and in those later years averaging across large areas would have much less impact than it does now...).
matplotlib.rcParams["figure.figsize"] = (10, 9)
for fgc in ["coal", "petroleum", "natural_gas"]:
fig, axes = plt.subplots(ncols=3, nrows=3, sharex=True, sharey=True)
for agg, ax in zip(aggs, axes.flatten()):
logger.info(f"Plotting {agg} error")
error = frc.loc[frc.fuel_group_code==fgc, agg+"_err"]
ax.hist(error[(error != 0.0)], range=(-1,1), bins=200)
iqr = scipy.stats.iqr(error, nan_policy="omit")
ax.set_title(f"{agg} (iqr={iqr:0.3})")
ax.axvline(x=0, lw=1, ls="--", color="black")
fig.suptitle(f"Fuel Group: {fgc}", fontsize=16)
fig.tight_layout()
plt.show();
To see how different the filled values are from the reported values, and whether there's an obvious bias being introduced, I made the plots below. It looks like maybe there's a sliiiiight rightward skew in the filled values for coal and natural gas, but not sure it's significant.
fig, axes = plt.subplots(ncols=3)#, sharex=True, sharey=True)
fuel_group_codes = ["coal", "petroleum", "natural_gas"]
matplotlib.rcParams["figure.figsize"] = (10, 4)
for fgc, ax in zip(fuel_group_codes, axes):
ax.hist(
frc.loc[frc.fuel_group_code == fgc, "fuel_cost_per_mmbtu"],
label="reported",
alpha=0.5,
range=(0,40),
bins=80,
density=True,
)
ax.hist(
frc.loc[(frc.fuel_group_code == fgc) & (frc.fuel_cost_per_mmbtu.isna()), "filled"],
label="filled",
alpha=0.5,
range=(0,40),
bins=40,
density=True,
)
ax.legend(loc="upper right")
ax.set_title(f"Fuel Group: {fgc}")
ax.set_xlabel("Fuel Price [$/MMBTU]")
fig.tight_layout()
plt.show()
To compare the filled in values from the API vs. our aggregations for each fuel group and aggregation method, I made this plot.
sns.relplot(
data=frc[
(frc.fuel_group_code.isin(["coal", "petroleum", "natural_gas"]))
& (frc.fuel_cost_from_eiaapi)
& (~frc.filled_by.isin(["state_esc_year", "region_fgc_year"]))
],
x="fuel_cost_per_mmbtu",
y="filled",
row="fuel_group_code",
col="filled_by",
facet_kws=dict(xlim=(0,40), ylim=(0,40)),
linewidth=0.0,
alpha=0.1,
s=5,
)
Here the x-axes are the API based fuel price, and the y-axes are the aggregation based fuel prices. Columns are the aggregation methods, and rows are the fuel groups. In the aggregation methods fgc = fuel group code, and esc = energy source code (the more specific fuels)
For the petroleum
and natural_gas
fuel groups, there's a good 1:1 correlation between the API and aggregation based prices, but it's harder to see what's going on with coal because the points are all bunched together so I zoomed in on those prices only:
sns.relplot(
data=frc[
(frc.fuel_group_code.isin(["coal"]))
& (frc.fuel_cost_from_eiaapi)
& (~frc.filled_by.isin(["state_esc_year", "region_fgc_year"]))
],
x="fuel_cost_per_mmbtu",
y="filled",
col="filled_by",
col_wrap=3,
facet_kws=dict(xlim=(0,8), ylim=(0,8)),
linewidth=0.0,
alpha=0.1,
s=5,
)
For the state-level aggregations, there's some scatter, but generally a correlation between the API and aggregation based fuel prices. However, for the regional and national averages, it seems like there's a more substantial difference between the two methods. Some of the variability that's present in the API results is getting flattened by these larger spatial averages.
This isn't to say that the new method is necessarily worse, since there were fuel type categorizations are worse, given the fuel grouping differences between PUDL and EIA, but it does suggest that there's some benefit to be had by refining the imputation further, especially since one of our main applications of this data is to identify unusually expensive coal plants for advocates. We also have the most additional information about the coal deliveries, since we usually know the mine name, what state it's in, maybe a unique ID for the mine, the supplier name, level of ash and sulfur in the coal, what kind of contract it was delivered on, etc.
Here are the proportions of coal records that are being filled in with the different aggregations:
frc[frc.fuel_group_code=="coal"].filled_by.value_counts() / sum(frc.fuel_group_code=="coal")
original 0.713344
state_esc_month 0.144066
region_fgc_month 0.067036
national_esc_month 0.042347
state_fgc_month 0.026690
national_fgc_month 0.004286
state_esc_year 0.001138
region_fgc_year 0.001093
About 11% of all coal records end up getting regional or national average prices assigned to them, and that's a little more than 1/3 of the total coal records which are being filled in.
I'm inclined to say that this is good enough for now, but that because these are probably some of the records we care most about, it will be worth coming back to this issue and trying to use a more generic model like XGBoost or a random forest that can utilize the other information we have.
What do y'all think?
I was ready to say "yes this is good enough", especially considering the results were so similar to our previous method of EIA API values. But now I'm confused!
Here the x-axes are the API based fuel price
I think I'm not understanding exactly what these plots are -- I thought your scatter plots near the top of this issue showed that API vs state-month average were extremely linear. Why are the state-fgc-month
results so much different here? I'd be surprised if changing the agg func from mean to median made that much difference. If it did, that's interesting to know!
@TrentonBush The earlier scatter plots are comparing all the reported data -- so the ones where there actually was data in the FRC table, and they're only being aggregated by [state, month, fuel_group]
The more scatter recent plots are only looking at data points that were not present in the FRC table, and comparing the values which were filled in by our new method (breaking it out into all the different kinds of aggregation used) vs. the API values. So it's not surprising that the correlation is worse in general.
Oh ok, now if I squint I can see how they are related -- basically remove the line and just look at the dispersion, plus add some x-coordinate noise to account for mean to median. Ya this model makes sense to me. I think using the median is justified given the handful of big outliers I ran into. We definitely want to use the mean for the purpose of disaggregating the API results in the future though.
one of our main applications of this data is to identify unusually expensive coal plants for advocates
That's an awesome use case I'd love to learn more about! There is certainly more accuracy to wring out of this data, and that kind of anomaly detection would probably require it. Maybe we find a day or two next sprint to try other models
Yeah, there are some wild outliers. I want to figure out how to remove them and fill them in with reasonable values. Some of them are out of bounds by factors of like 1000x -- natural gas is particularly messy. The mean values were quite skewed.
Given that the values aren't really normally distributed, would it still make sense to replace anything more than 3-4$\sigma$ away from the mean? I could drop the top and bottom 0.1% of the distribution too, but there aren't always big outliers, and sometimes more than 0.1% of the points are bad. Not sure what the right thing to do is.
I'm doing some additional normalization to the table and moving the fuel_group_code
to the energy_sources_eia
table, since the value is entirely determined by energy_source_code
with this mapping:
assert (frc.groupby("energy_source_code").fuel_group_code.nunique() == 1).all()
dict(frc.groupby("energy_source_code").fuel_group_code.first())
{'BFG': 'other_gas',
'BIT': 'coal',
'DFO': 'petroleum',
'JF': 'petroleum',
'KER': 'petroleum',
'LIG': 'coal',
'NG': 'natural_gas',
'OG': 'other_gas',
'PC': 'petroleum_coke',
'PG': 'other_gas',
'RFO': 'petroleum',
'SC': 'coal',
'SGP': 'other_gas',
'SUB': 'coal',
'WC': 'coal',
'WO': 'petroleum'}
I'm also renaming the column to energy_group_eiaepm
since it's the fuel grouping that's used for reporting in the EIA Electric Power Monthly.
That's an awesome use case I'd love to learn more about!
This is basically the RMI ReFi project! The fuel is by far the biggest component of the OpEx for coal plants, and this data has been used to identify which plants are vulnerable to getting refinanced out of existence.
Okay, definitely some room for improvement. E.g. look at how how coal prices in NY drop dramatically (in the filled in data) starting in 2012 here:
def facet_fuel(
frc: pd.DataFrame,
states: list[str],
fuel_facet: Literal["energy_source_code", "fuel_group_eiaepm", "fuel_type_code_pudl"],
by: Literal["fuel", "state"] = "state",
fuels: list[str] | None = None,
max_price: float = 100.0,
facet_kws: dict | None = None,
) -> None:
mask = (
(frc.state.isin(states))
& (frc.fuel_cost_per_mmbtu <= max_price)
& (frc[fuel_facet].isin(fuels) if fuels else True)
)
facet = fuel_facet if by == "state" else "state"
sns.relplot(
data=frc[mask],
x="report_date",
y="fuel_cost_per_mmbtu",
hue=facet,
style=facet,
row=fuel_facet if by == "fuel" else "state",
kind="line",
height=4,
aspect=2.5,
facet_kws=facet_kws,
);
facet_fuel(
frc,
fuel_facet="fuel_group_eiaepm",
states=["CA", "NY", "FL", "TX"],
fuels=["coal", "natural_gas", "petroleum"],
by="fuel",
facet_kws={"sharey": False},
)
I ran tox -e nuke
to generate a new database with the slight schema changes, and validate its contents, and ended up getting the capacity_mw
and row count errors that @katie-lamb was encountering after the date_merge
updates, so I merged the changes from her current PR (#1695) and re-ran the validations, but I'm still getting a bunch of row count errors, so maybe they really are due to something that changed here?
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 2507830 rows, expected 2507040. Off by 0.032%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 214669 rows, expected 214613. Off by 0.026%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_nonuclear_eia923-2492441-2491704-213329] - ValueError: gf_nonuclear_eia923: found 213382 rows, expected 213329. Off by 0.025%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_nuclear_eia923-23498-23445-1961] - ValueError: gf_nuclear_eia923: found 1964 rows, expected 1961. Off by 0.153%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 2507830 rows, expected 2507040. Off by 0.032%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_nonuclear_eia923-2492441-2491704-213329] - ValueError: gf_nonuclear_eia923: found 2492441 rows, expected 2491704. Off by 0.030%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_nuclear_eia923-23498-23445-1961] - ValueError: gf_nuclear_eia923: found 23498 rows, expected 23445. Off by 0.226%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%
It's not clear off the top of my head why the changes in this PR would result in small changes to the number of rows in these other tables though. Does that make sense to anyone else?
Comparing old expectations, new expectations, and the rows we've got here...
table | observed rows | old rows | date_merge rows | matches |
---|---|---|---|---|
gens_eia860_raw | 491,469 | 491,469 | 490,824 | old |
gens_eia860_annual | 491,469 | 491,469 | 490,824 | old |
gens_eia860_monthly | 491,469 | 491,469 | 490,824 | old |
gf_eia923_raw | 2,507,830 | 2,507,870 | 2,507,040 | neither |
gf_eia923_annual | 214,669 | 214,709 | 214_613 | neither |
gf_eia923_monthly | 2,507,830 | 2,507,870 | 2,507,040 | neither |
gf_nonuclear_eia923_annual | 213,382 | 213,422 | 213,329 | neither |
gf_nonuclear_eia923_monthly | 2,492,441 | 2,492,481 | 2,491,704 | neither |
gf_nuclear_eia923_annual | 1,964 | 1,964 | 1,961 | old |
gf_nuclear_eia923_monthly | 23,498 | 23,498 | 23,445 | old |
plants_eia860_raw | 171,570 | 171,570 | 171,148 | old |
plants_eia860_annual | 171,570 | 171,570 | 171,148 | old |
plants_eia860_monthly | 171,570 | 171,570 | 171,148 | old |
pu_eia860_raw | 170,792 | 170,792 | 170,370 | old |
pu_eia860_annual | 170,792 | 170,792 | 170,370 | old |
pu_eia860_monthly | 170,792 | 170,792 | 170,370 | old |
utils_eia860_raw | 113,464 | 113,464 | 113,357 | old |
utils_eia860_annual | 113,464 | 113,464 | 113,357 | old |
utils_eia860_monthly | 113,464 | 113,464 | 113,357 | old |
More weirdness.
The number of FRC records filled with each of the different aggregations seems a little different now than it was when I was originally developing the process in a notebook.
pudl.output.fuel_receipts_costs_eia923()
function only one of the national aggregations has any records associated with it.date_merge()
is run on the FRC output dataframe near the end of the output function. Need to figure out what those records are and why.plant_id_eia
values don't have states associated with them in the plants_entity_eia
table, which seems pretty weird. They all start with 88, and their plant records are almost entirely null.The lack of states is surprising, but seems like that's just how it is. The loss of a bunch of records in the date_merge()
process is confusing. @katie-lamb and @cmgosnell could we talk about what might be going on there? I'm wondering if it might be due to the fact that the FRC table isn't a "time series" in that the report_date
field isn't part of a primary key -- it's just a bunch of transactions, and there are some valid duplicate records. We use an autogenerated pseudo-key in the database. I'll compile a dataframe of the records that are getting lost.
After running the ETL, I updated the row counts and merged that PR into dev so that should fix some of your issues @zaneselvans
Ah okay so some of the row counts in your branch weren't actually the expected ones?
@katie-lamb any intuition as to why merge_date()
would be causing ~11,000 rows to get dropped from the FRC table?
Comparing the records that exist before and after date_merge()
it looks like all of the records for plants that are missing a state
field (10,511) are getting dropped for some reason, plus 528 additional records that do have states associated with them. Note that the plant IDs that don't have states are also overwhelmingly NULL in the entity tables, so if there's any additional merging that's happening on other columns in there maybe that would mess things up?
Are these valid duplicate records that are getting treated strangely? If I list all of the duplicate records in the FRC table that I've pulled directly from the DB, it gives me 3510 of them, and all but one of them has a state, so it doesn't seem to be that.
In the same method chain where date_merge()
is happening there's also a dropna(subset=["utility_id_eia"])
-- I bet that's where the problem is. The plants that don't have states are also missing almost everything else, so probably they're missing utility IDs. And then a smattering of the other plants just happen to be missing Utility IDs for some other reason.
Counting the rows in the dataframe before / after that dropna()
specifically, that's indeed where the 11,000 records drop out. Whoops.
For the purposes of the FRC table is it absolutely necessary that we have the utility ID? The data is really about plants, fuels, and dates, so it seems like this isn't strictly required. If I remove it what happens...
In that case, we keep all the FRC records, but all of the fields which are brought in by merging with the results of pudl.output.eia860.plants_utils_eia860()
are NA (plant name, PUDL ID, etc.). which may cause other unexpected problems downstream, so this probably isn't ideal, and the real issue is with how plants_utils_eia860()
is working in cases where there are some missing ID fields, so I think I should make an issue to deal with this... #1700
Looking for ways to identify outliers in not-so-normal distributions I came across the modified z-score, which is analogous to the z-score in normal distributions, but instead of looking at how many standard deviations away from the mean a data point is, it looks at how many Median Absolute Deviations (MADs) away from the median it is:
mod_zscore = abs(0.6745 * (data - data.median()) / data.mad())
This seems like the kind of measure we want.
In messing around with this I also ended up looking at the median (aka delivery weighted median) vs. (mmbtu) weighted median values of fuel prices in the FRC table, and they're pretty different. I think this is because crazy expensive fuel deliveries tend to be very small -- so they make up a disproportionate number of deliveries (records) compared to how much heat content they represent. The delivery weighted median is $3.25 / MMBTU but the MMBTU weighted median is $2.09 / MMBTU.
A weighted median function that can be used with df.groupby().apply()
:
def weighted_median(df: pd.DataFrame, data: str, weights: str) -> float:
df = df.dropna(subset=[data, weights])
s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))
midpoint = 0.5 * np.sum(s_weights)
if any(df[weights] > midpoint):
w_median = df[data][weights == np.max(df[weights])].array[0]
else:
cs_weights = np.cumsum(s_weights)
idx = np.where(cs_weights <= midpoint)[0][-1]
if cs_weights[idx] == midpoint:
w_median = np.mean(s_data[idx:idx+2])
else:
w_median = s_data[idx+1]
return w_median
Grouping by report_year
and fuel_group_eiaepm
and calculating the modified z-score within each group, we get a pretty clean normal-ish distribution, with some long straggling tails that I think we can probably cut off.
DATA_COLS = [
"plant_id_eia",
"report_date",
"energy_source_code",
"fuel_cost_per_mmbtu",
"fuel_received_units",
"fuel_mmbtu_per_unit",
]
GB_COLS = ["report_year", "fuel_group_eiaepm"]
MAX_MOD_ZSCORE = 5.0
logger.info("Query PUDL DB")
frc_db = pd.read_sql("fuel_receipts_costs_eia923", pudl_engine, columns=DATA_COLS)
plant_states = pd.read_sql("SELECT plant_id_eia, state FROM plants_entity_eia", pudl_engine)
fuel_group_eiaepm = pd.read_sql("SELECT code AS energy_source_code, fuel_group_eiaepm FROM energy_sources_eia", pudl_engine)
logger.info("Join tables and calculate derived values")
frc_db = (
frc_db
.merge(plant_states, on="plant_id_eia", how="left", validate="many_to_one")
.merge(fuel_group_eiaepm, on="energy_source_code", how="left", validate="many_to_one")
.assign(
report_year=lambda x: x.report_date.dt.year,
fuel_mmbtu_total=lambda x: x.fuel_received_units * x.fuel_mmbtu_per_unit,
)
.pipe(apply_pudl_dtypes, group="eia")
)
logger.info("Calculate weighted median fuel price")
weighted_median_fuel_price = (
frc_db
.groupby(GB_COLS)
.apply(weighted_median, data="fuel_cost_per_mmbtu", weights="fuel_mmbtu_total")
)
weighted_median_fuel_price.name = "weighted_median_fuel_price"
weighted_median_fuel_price = weighted_median_fuel_price.to_frame().reset_index()
frc_db = frc_db.merge(weighted_median_fuel_price, how="left", on=GB_COLS, validate="many_to_one")
logger.info("Calculate weighted fuel price MAD")
frc_db["delta"] = abs(frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price)
frc_db["fuel_price_mad"] = frc_db.groupby(GB_COLS)["delta"].transform("median")
logger.info("Calculate modified z-score")
frc_db["fuel_price_mod_zscore"] = (0.6745 * (frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price) / frc_db.fuel_price_mad).abs()
fraction_outliers = sum(frc_db.fuel_price_mod_zscore > MAX_MOD_ZSCORE) / len(frc_db)
logger.info(f"Modified z-score {MAX_MOD_ZSCORE} labels {fraction_outliers:0.2%} of all prices as outliers")
plt.hist(frc_db.fuel_price_mod_zscore, bins=80, range=(0,8), density=True)
plt.title("Modified z-score when grouped by fuel and year");
sns.displot(
data=frc_db[frc_db.fuel_group_eiaepm.isin(["coal", "natural_gas", "petroleum"])],
kind="hist",
x="fuel_price_mod_zscore",
row="fuel_group_eiaepm",
bins=80,
binrange=(0,8),
height=2.5,
aspect=3,
stat="density",
linewidth=0,
);
I think have weighted, modified z-score outlier detection working, but am wondering how to apply it appropriately.
wmod_zscore
for high-level aggregations (like year, fuel group) at the beginning and then let each of the subsequent aggregations fill in the NAs.wmod_zscore
for each individual aggregation as we go, so that "outliers" are being identified relative to the distribution whose median value would be getting used to fill in missing / outlying values.The latter sounds more correct, but some of the groupings won't have a large number of records in them.
Also: what's the right wmod_zscore
value to use as the cutoff for identifying outliers? At wmod_zscore == 5.0
it identifies about 1.5% of all records as outliers, mostly natural gas. Those outlier records account for only 0.2% of all reported MMBTU.
Outliers identified using modified z-score of aggregations, with max_mod_zscore == 5.0
:
agg_mod_zscore=["report_year", "fuel_group_eiaepm"]
: 1.5% of records (0.2% of mmbtu). No empty groups.agg_mod_zscore=["report_year", "fuel_group_eiaepm", "census_region"]
: 1.73% of records (0.37% of mmbtu). A few empty groups.agg_mod_zscore=["report_year", "fuel_group_eiaepm", "state"]
: 1.89% of records (0.74% of mmbtu). Lots of empty groups.agg_mod_zscore=["report_date", "fuel_group_eiaepm", "census_region"]
: 2.18% of records (0.89% of mmbtu). Lots of empty groups."Robust" outlier detection metrics are great! Their downsides are that 1) they can be too aggressive, especially on skewed data and 2) as you've noticed, they are slow because of all the O(n*logn) sorting.
I'm of two minds here. On one hand, it'd be fun and beneficial to tune this model for both accuracy and performance. On the other, we're considering replacing all this with a better model anyway, so maybe we should wait on any tuning until we make that determination. This outlier detection problem probably fits in better the with "improve the imputation model" tasks because both imputation and outlier detection might use the same model. Also, outlier detection is a scope increase from the original "remove the EIA API" goals. The very linear plots of state-month averages vs EIA API imply that EIA isn't doing much outlier removal, which means our users have been using data with outliers averaged in the whole time. I think it makes sense to change things once and well rather than possibly twice.
I'd start by seeing if we can un-skew the price distributions before (and maybe instead of) applying the robust metrics. A variance-stabilizing transform like log or Box-Cox might make the data "normal" enough to use regular z-scores.
Second, have we checked what the minimum sample sizes are for some of these bins? As far as I can tell, the existing method calculates a bunch of types of aggregates and iteratively fills in NaN values in order of model precision. Sometimes a key combination like ["Rhode Island", "2008-01-01", "Coal"]
might be empty and produce np.nan
, which is later filled by a coarser aggregate, but other times it might only have a handful of plants and produce weird metrics as a result. Now that we're using this model for both prediction and outlier detection, it could cause problems in both.
Finally, the model we've built here is basically a decision tree, just manually constructed instead of automated. Problems like choosing the right grain of aggregation or enforcing a minimum sample size per bin are taken care of automatically by those algorithms!
Medians are always going to be slower, but the line
s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))
jumped out at me because of the use of python iterators on pandas/numpy objects. I think we could probably get the same result faster with something involving df.loc[:, [data, weights]].sort_values(data)
, df.loc[:, weights].cumsum()
, and np.searchsorted
I will admit that
s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))
was just something I adapted from an example which was operating on lists or numpy arrays, and I didn't completely parse what it is doing to try and re-write it intelligently... like what is going on with the zip inside a zip and unpacking the sorted list. Is it obvious to you what it's doing? I'm not even sure what it means to sort a list of tuples necessarily.
The minimum sample size in some of these bins is definitely small, but it'll use the small sample if it can (even if that's a bad idea). Requiring a bigger minimum sample size would result in more frequently resorting to the less specific estimations, but with many more samples.
Should we have just gone straight to using one of the Swiss-army-knife regression models instead, and just started with a modest set of input columns as features, to be expanded / engineered more later? I haven't really worked with them so I don't know how they're finicky or what the pitfalls are.
I will admit that
was just something I adapted from an example
Let he who has not copied from StackOverflow cast the first stone 😂
I had no clue what sorting tuples did, but I just tried it and it sorts them by their first value. I think that pattern is designed for python objects: when given iterables representing columns, the inner zip
links the values by row so that sorted
doesn't break the connection between them, then the outer zip
puts them back into columnar layout. Map
ping np.array then turns each column into a separate array.
But pandas can accomplish the same thing with df.sort_values(<my_column>)
and some indexing.
Also I just noticed the python any()
function, which I think will also resort to python iteration. Using df.any() or np.any() will be much faster (relatively, probably not even 100ms difference at n=500k).
Should we have just gone straight to using one of the Swiss-army-knife regression models instead
No, starting with groupby is much simpler, plus it matched the algorithm we were trying to replace. But once we start adding complexity like conditionally switching between groupbys and checking sample sizes, I think that effort is better spent on a model with a higher ceiling.
Okay here's a pandas-ified version
def weighted_median(df: pd.DataFrame, data: str, weights: str, dropna=True) -> float:
if dropna:
df = df.dropna(subset=[data, weights])
if df.empty | df[[data, weights]].isna().any(axis=None):
return np.nan
df = df.loc[:, [data, weights]].sort_values(data)
midpoint = 0.5 * df[weights].sum()
if (df[weights] > midpoint).any():
w_median = df.loc[df[weights].idxmax(), data]
else:
cs_weights = df[weights].cumsum()
idx = np.where(cs_weights <= midpoint)[0][-1]
if cs_weights.iloc[idx] == midpoint:
w_median = df[data].iloc[idx : idx + 2].mean()
else:
w_median = df[data].iloc[idx + 1]
return w_median
Old version: 4min 26s New version: 5min 33s 🤣
But they do produce exactly the same output!
Moving some bits from Slack to here for posterity.
I tried (notebook) a quick implementation of an XGBoost equivalent (lightGBM). Using the same state, month, fuel features as the groupby model, it gave very similar accuracy. After adding a few new features, it cut relative error in about half. Training/inference time were around 10 seconds total
Cross validation shows that the groupby model overfits and performs a bit worse on unseen data. Also, note the difference in counts below. I checked whether the GBDT accuracy was different on the records that the groupby model could predict, but the only significant difference was that other_gas
results (now only 9 of them) were much better.
Test set relative error for groupby(['state', 'report_date', 'fuel_group_code'])
medians
Test set results for GBDT. Features: ['state', 'report_date', 'fuel_group_code', 'contract_type_code', 'primary_transportation_mode_code', 'energy_source_code', 'fuel_received_units', 'latitude', 'longitude',]
I'm closing this because it's effectively metastasized into #1708
Instead of using the EIA API to pull monthly average fuel costs by state and fuel when individual fuel deliveries have their costs redacted in the
fuel_receipts_costs_eia923
table, calculate it for ourselves.Motivation
This change will address several issues:
fuel_receipts_costs_eia923
table, and related to the plants and mines and suppliers involved. It should be possible to do a fairly good estimation of the fuel prices from scratch given all that context.Approach
Choosing Aggregations
Intuitive Aggregation Priorities
["state", "energy_source_code", "report_date"]
["state", "energy_source_code", "report_year"]
["census_region", "energy_source_code", "report_date"]
["state", "fuel_group_code", "report_date"]
["census_region", "fuel_group_code", "report_date"]
["census_region", "fuel_group_code", "report_year"]
Questions:
Other Potential Refinements
Remaining tasks:
plant_state
into thefuel_receipts_costs_eia923
output table all the time.fuel_receipts_costs_eia923
output routine.frc_eia923
API_KEY_EIA
infrastructure from everywhere in the code, so we aren't unknowingly relying on it.filled_by
labeling, which is now showing all filled values havingnational_fgc_year
which is the last aggregation.fuel_group_code
from thefuel_receipts_costs_eia923
table and add it to theenergy_sources_eia
coding table, and add it back into the output function.merge_date()
is removing ~10kfrc_eia923
records.main
removeAPI_KEY_EIA
from the GitHub secrets.