Quantify and categorize bad data in FERC 714 demand time series

zaneselvans commented 4 years ago

Using the outlier / bad data identification tools from Tyler Ruggles, as improved by Greg Schivley, categorize and quantify any outlier / zero values that are showing up in the FERC 714 time series, so we can understand how important they are, and devise a strategy for imputing reasonable values.

zaneselvans commented 3 years ago

@ezwelty I cut-and-pasted the time series anomaly categorization routines that @truggles and @gschivley had pulled together into the pudl.analysis.demand_anomalies module which currently only exists on the demand-anomalies branch. You can find more details about what all they've been used to do in the EIA-930 use case at:

ezwelty commented 3 years ago

I have restructured Tyler Ruggles' anomalies detection code (here: 5d41da5d5a8b179ca1e97f6062f9f7d4f5d259a2) and checked it against their published EIA results (https://zenodo.org/record/4116342).

For imputation, after some poking around, I decided to try Xinyu Chen's LATC-TNN method (Python code: https://github.com/xinychen/tensor-learning/tree/master/mats, description: https://arxiv.org/abs/2006.10436). The imputations compare very well to the published results, with percent deviations (abs(ruggles_imputed - imputed) / ruggles_imputed) of 3.6% +- 5.3%. Here are some screenshots (light grey is cleaned Ruggles timeseries, dark grey is mine, orange dots mark the imputed values):

example-1 example-2 example-3

Shall we run with this? It would take me a lot longer to figure out how to replicate their MICE method in Python (using say, https://github.com/AnotherSamWilson/miceforest). If there is time, however, I could follow the same validation procedure (https://www.nature.com/articles/s41597-020-0483-x#Sec10) by nulling good values and comparing them to the imputed values.

zaneselvans commented 3 years ago

Wow, that's awesome! Do you really understand the autoregressive tensor regression method? Or was it just clearly solving the same general type of problem? I looked at the preprint and feel like I only have a very vague feeling of what it's doing... Like it's folding together the different periodicities (days, weeks, years -- do you have to specify these, or does it identify them based on autocorrelations?) on top of each other, within a given time series, which in this dataset would correspond to a particular balancing authority, and estimating what the values should be based on where within each of those periodicities the missing values fall, and what the observed values in each of those periodicities is? But there's no information being communicated between the different time series based on their adjacency in the overall electricity grid / geographic proximity (which I believe is part of the MICE method).

Does it take long to run? How much work do you think it would be to validate the method on some artificially nulled values? Is this only implemented in the notebooks from their repo? It seems like it should get integrated into a Python package. I suspect this is plenty good for our purposes at least initially, and much better than the simplistic stuff I was thinking of doing.

Also nice job organizing the anomaly detection and flagging code! It's so much more uniform now. Not sure I understand all the median-of-medians-of-diffs... but I hadn't dug into the codes before.

truggles commented 3 years ago

This is really exciting to see! Nice work @ezwelty

A few thoughts immediately coming to mind:

@ezwelty, we found that looking at the biases by time of day, day of week, and month of year was very helpful. Making some tables or figures like these would help you spot if the method is having issues capturing any specific time of X features: https://www.nature.com/articles/s41597-020-0483-x/figures/5 (if running on artificially null values takes longer, can just do abs(ruggles_imputed - imputed) / ruggles_imputed) for a quick comparison)
@ezwelty, can this method be run on a single BA in isolation? (sorry didn't read your reference yet)
@zaneselvans, the MICE code does not include any knowledge of geographic proximity. A version of this is essentially calculated as regression coefficients of BA_1 with all other BAs.

ezwelty commented 3 years ago

@zaneselvans I haven't studied the method deeply, but it is designed for problems like ours: imputing multiple strongly-cyclical, correlated timeseries simultaneously. The implementation (if not also the theory) is limited to 3-dimensions, so in our case balancing authority (BA) x day of year x hour of day. I believe placing the principal, diurnal, cycle on its own dimension offers computational advantage, but the method also captures autoregressive (seasonal) processes along the day axis. Just like with MICE, other balancing authorities are used for imputation by leveraging their correlation (which are likely a function of spatial proximity, but need not be). Yes, the method only exists in Jupyter Notebooks, so I would paste it into PUDL.

The full four years of EIA data (54 BA x 1461 days x 24 hours) takes 472 seconds to run single-threaded and closely matches the MICE results (percent error: 3.4% +- 5.3%). Running each year separately takes only 35 seconds each and gives very similar results.

@truggles Thanks for jumping in! How do those timings compare to your MICE routine? Like MICE, the method does not work on a single BA in isolation, since it relies on the (very valuable) correlation between them. For a single BA, perhaps a Kalman Filter could be used?

As for validation, here is what I've now done:

Extract the length and number of null runs from each cleaned (but not imputed) series.
Insert the same number and length of null runs randomly into the timeseries such that they do not touch existing null values or each other. (The final timeseries has the same run-length distribution of null values, except twice as many). This worked for all but HST, NSB, and WAUW, which have runs that are too long to fit elsewhere, so I did not include them. How does this differ from you insertion logic?
Impute!
Compare imputed values to actual values.

Results are for a single null insertion run. mean percent error: -1.04% mean absolute percent error: 4.80%

Figure_1 Figure_2 Figure_3

BA	MPE	MAPE
AEC	-0.05%	5.41%
AECI	0.14%	1.06%
AVA	0.54%	2.42%
AZPS	-0.1%	1.74%
BANC	-1.95%	3.68%
BPAT	-0.53%	1.28%
CHPD	1.13%	4.52%
CISO	-0.51%	2.05%
CPLE	-1.24%	2.47%
CPLW	-0.78%	3.48%
DOPD	-0.35%	3.36%
DUK	-0.34%	2.12%
EPE	-0.78%	3.13%
ERCO	0.52%	2.57%
FMPP	0.25%	1.61%
FPC	0.79%	1.66%
FPL	-0.92%	1.87%
GCPD	-0.17%	2.25%
GVL	0.4%	3.89%
IID	-2.59%	6.99%
IPCO	-1.09%	3.41%
ISNE	-0.29%	0.86%
JEA	-0.26%	2.76%
LDWP	-0.47%	2.44%
LGEE	2.3%	3.48%
MISO	-0.01%	0.45%
NEVP	-0.46%	2.33%
NWMT	0.43%	1.94%
NYIS	-0.74%	0.83%
PACE	-0.42%	3.11%
PACW	-0.29%	3.38%
PGE	-1.33%	2.8%
PJM	-0.82%	1.63%
PNM	-0.8%	1.53%
PSCO	-0.55%	2.64%
PSEI	0.51%	2.28%
SC	-1.94%	4.08%
SCEG	1.74%	2.21%
SCL	-2.35%	3.02%
SOCO	0.27%	3.19%
SPA	0.24%	8.09%
SRP	-0.17%	2.26%
SWPP	-0.93%	3.73%
TAL	-0.65%	3.43%
TEC	0.26%	0.85%
TEPC	-6.24%	19.0%
TIDC	-0.12%	2.03%
TPWR	-0.18%	2.72%
TVA	0.72%	2.87%
WACM	-0.49%	2.34%
WALC	-0.97%	6.17%

truggles commented 3 years ago

I was usually running 2hr+ on two cores while my colleagues with 4 cores was closer to 1hr for the full 54 BA x 1461 days x 24 hours. Is your 472s time for a single run?

Your imputation comparison method sounds exactly like ours.

For the performance of MICE: avg of all MAPE values = 3.5%, with min 1.3% max 11% avg of all MPE values = 0.33%, with min -0.38% max 4.0% The results from updating to 5 years of data have a nice csv format if you want to compare: https://github.com/truggles/EIA_Cleaned_Hourly_Electricity_Demand_Data/blob/master/data/release_2020_Oct/imputation_results_and_biases.csv For the 4 year results you will have to look at table 3 in the paper.

We had trouble imputing TEPC as well, but did not see quite as large of deviations as in your numbers. MAPE 19% (you) 11.6% (us) MPE -6.24% (your) 0.76% (us) Maybe worth a visual inspection, we did not spot any clear "problems" with the imputations in ours.

These results are for a single run? We used an average of 16 "runs" as our final result (each imputed value was averaged across 16 chains (runs) to calculate final imputed result). In this case, the time I gave above is for all 16 chains. You might want to see if results are improving with averaging over a few runs.

The figures you are sharing are for all BAs combined, correct? Fig 5 is for CISO here: https://www.nature.com/articles/s41597-020-0483-x/figures/5

ezwelty commented 3 years ago

@truggles

Is your 472s time for a single run? These results are for a single run?

The method is iterative, but it is deterministic as it solves all BA timeseries simultaneously, so it is only run once. It took 472 s for one 4-year block, and 128 s if run as four 1-year blocks. The validation results I am reporting are for the latter, since they are both faster and just as good as the 4-year block results.

My previous report was for one randomly-nulled and imputed dataset. To better match your paper (I think), here are results averaged over 16 randomly-nulled datasets.

BA mean percent error (MPE), averaged across 16 runs, averaged across all BA: -0.32%
BA mean absolute percent error (MAPE), averaged across 16 runs, averaged across all BA: 3.17%

BA	MPE (%)	MAPE (%)
AEC	-0.27	4.16
AECI	0.16	1.87
AVA	-0.04	2.7
AZPS	-0.05	2.88
BANC	-0.26	4.11
BPAT	0.14	1.19
CHPD	-0.95	5.17
CISO	-0.24	1.99
CPLE	-0.39	2.92
CPLW	-0.53	3.46
DOPD	-0.48	3.57
DUK	0.18	2.36
EPE	-0.15	2.88
ERCO	0.08	3.28
FMPP	0.16	1.52
FPC	-0.04	1.94
FPL	0.03	2.18
GCPD	0.01	2.46
GVL	0.11	3.8
IID	-1.77	7.35
IPCO	-0.97	3.5
ISNE	0.03	0.9
JEA	0.24	3.24
LDWP	-0.54	2.58
LGEE	-0.5	2.62
MISO	-0.05	0.46
NEVP	-0.26	2.17
NWMT	0.09	2.31
NYIS	-0.52	1.25
PACE	-0.15	2.88
PACW	-0.1	3.29
PGE	-0.44	2.82
PJM	0.08	2.01
PNM	0.13	1.71
PSCO	0.11	3.34
PSEI	0.18	3.31
SC	-1.82	4.81
SCEG	0.67	2.35
SCL	-0.44	2.38
SOCO	-0.02	2.8
SPA	-1.16	8.49
SRP	-0.44	2.73
SWPP	0.36	2.48
TAL	-0.59	3.48
TEC	-0.79	2.24
TEPC	-2.73	11.8
TIDC	-0.43	2.62
TPWR	-0.23	2.61
TVA	0.12	2.91
WACM	-0.25	2.96
WALC	-1.54	6.59

Boxplots of PME (across 16 runs) for CISO only:

Figure_1 Figure_2 Figure_3b

The TEPC results are so bad because the 4-month true gap is often (but not always) followed only a few days later by an additional synthetic 4-month gap. The placement of that second 4-month gap has a huge influence on the validation result.

Your imputation comparison method sounds exactly like ours.

Except how did you squeeze in runs of null values that were too long to fit elsewhere in the series? For example, HST has 648 null values, but no run of good values of that length. In that case, did you split it into two shorter runs?

zaneselvans commented 3 years ago

Oh @ezwelty as you move into applying these methods to the FERC 714, @truggles also wrote up a comparison of our FERC 714 and his EIA 930 hourly demand time series that might be useful: http://thruggles.com/archives/2020/06/2162/comparing-ferc-and-eia-electricity-demand-data/

truggles commented 3 years ago

@ezwelty

Your imputation comparison method sounds exactly like ours.
Except how did you squeeze in runs of null values that were too long to fit elsewhere in the series? For example, HST has 648 null values, but no run of good values of that length. In that case, did you split it into two shorter runs?

Good question. So, not exactly the same. The nulled values were randomly positioned. This could lead to an attempted insertion of null values in a region that is already nulled (therefore not quite doubling the quantity of imputed values), and could also lead to extending the length of already nulled or flagged segments (longer nulled segments that are harder to impute).

The method is iterative, but it is deterministic

Thanks, I did not catch that earlier. The MICE method is non-deterministic. Thus, we run it multiple times with different random seed values and take the average of the imputed value for each hour as the final imputed value.

For the nulled value comparison, we used a single set of nulled values and ran the MICE code 16 times and took the averages. This is just like we did for the main results.

zaneselvans commented 3 years ago

@ezwelty Maybe you already thought of this, and maybe it doesn't matter, but I was wondering how having the times stored in UTC in the FERC 714 demand data will impact the imputation process. I would expect that the the load curves for different balancing authorities would have a higher correlation with each other if they were being aligned based on local rather than universal time.

truggles commented 3 years ago

@zaneselvans and @ezwelty, we thought about this when we were setting up the MICE process. Our thinking was that the geographically similar entities are either in the same timezone or +/-1 hour. And, these geographically similar entities are the ones which you want to have correlated to rely on their local info and when imputing. You do not want entities in Florida correlated with those in California just because they have a similar load shape in general, which could possibly happen if imputing in local time.

I wish we could say we had a conclusive test to point to, but we decided to not go down this route.

Something that would be similar to imputing in local time would be allowing the imputation to use the lag 1, 2, ... and leading 1, 2, ... values from all BAs to determine the best correlation between each BA and all others. Again, we did not test this.

ezwelty commented 3 years ago

@truggles In the latest versions of the code, the Series class is a multivariate timeseries (rather than a single timeseries) and the anomaly detection is much faster. In particular, the anomalous region flag was taking 3m 44s for your 4-year EIA data. I vectorized it: 115d22d5d9915aabc4aa387e9ea4b671095bc600. It now takes 225 ms. In the process, I believe I fixed a bug in the original code (or, perhaps, in my initial version of the original code), which spuriously flagged values at the very start or end of the series that did not meet the criteria. For example:

WACM WACM PSCO PSCO SPA SPA

For the nulled value comparison, we used a single set of nulled values and ran the MICE code 16 times and took the averages. This is just like we did for the main results.

In the future, given how sensitive the quality of the imputation is to which values are randomly nulled when long null runs are present (and thus additional ones are being inserted), the imputation validation process should probably be repeated for multiple sets of nulled values.

@zaneselvans The imputer I am using (https://github.com/xinychen/tensor-learning/blob/master/mats/LATC-TNN-imputer.ipynb) provides support for this in the time_lags argument. I did not see any real improvement by extending time_lags=[1] (used for the results I shared above) to something like time_lags=[1, 2, 3]. So I expect the improvement from using local time would be negligible.

ezwelty commented 3 years ago

@truggles

The nulled values were randomly positioned. This could lead to an attempted insertion of null values in a region that is already nulled (therefore not quite doubling the quantity of imputed values), and could also lead to extending the length of already nulled or flagged segments (longer nulled segments that are harder to impute).

Your null-insertion method results in easier validation tests, as it allows inserted null runs to overlap each other and existing null runs. Using your null-insertion method, and all 51 balancing authorities, I believe we now have an apple-to-apple comparison. Here are the overall results – mean of the mean for each balancing authority – for 8 runs (each with different simulated null values):

MPE (%)	MAPE (%)
-0.07	2.72
-0.08	2.75
-0.25	3.02
-0.39	2.9
-0.03	2.74
-0.15	3.11
0.10	2.97
-0.19	3.00

The mean of these 8 runs is -0.13%, 2.90%. Which actually outperforms the published MICE results!

zaneselvans commented 3 years ago

@ezwelty Not that we should tackle this now since the current setup is already great and we really need to get results back to LBL, but I wonder if there's not a more general way to identify the bad values which should be nulled and imputed, based on their distances from their predicted values using this or another autoregression? Like which kinds of anomalies are we flagging that are statistically identifiable as bad in a generic way (e.g. zeroes and deltas), and which if any are domain specific and need to be handled by us specially (e.g. maybe identical runs)? It seemed like there were a lot of parameters involved in the bad data identification, and I wonder if they're really optimal. Especially with the 1000x speedup, it seems like many more combinations could be tried to optimize their selection now.

Also, I wonder how much work it would be to get the imputation method to conform to the scikit-learn API so it can be integrated there and used by others more generally. Would the method benefit from the more generalized parameter grid search that's available in scikit-learn?

catalyst-cooperative / pudl

Quantify and categorize bad data in FERC 714 demand time series #598