arundo / adtk

A Python toolkit for rule-based/unsupervised anomaly detection in time series
https://adtk.readthedocs.io
Mozilla Public License 2.0
1.06k stars 143 forks source link

VolatilityShiftAD can not detect negative anomaly. #132

Closed arthemis911222 closed 2 years ago

arthemis911222 commented 2 years ago

I use VolatilityShiftAD and set 'side=both/positive/negative' to see how different about them, but the results of them are exactly same. VolatilityShiftAD can not detect the negative anomaly.

('seismic-new.csv': copy the nomaly data which from 'seismic.csv' to the end of file)

pipnet = VolatilityShiftAD(window=20)
s = pd.read_csv('~/data/seismic-new.csv', index_col="Time", parse_dates=True, squeeze=True)
s = validate_series(s)

anomalies = pipnet.pipe_.fit_detect(s,return_intermediate=True)

plot(anomalies["diff_abs"])
#plot(s, anomaly=anomalies, anomaly_color='red')
plt.savefig("VSAD-std.png")

anomalies: VolatilityShiftAD-3

diff_abs(std): VSAD-std

I use Excel to calculate std at two time '08-05 15:05:00'(left window < 15:05:00, right window > 15:05:00) and '08-05 16:56:00', the result is different from the VolatilityShiftAD.

截屏2021-08-05 下午5 59 50 截屏2021-08-05 下午6 00 29
earthgecko commented 2 years ago

Hi @arthemis911222

I do not understand how you created seismic-new.csv so I am just using the normal seismic.csv to explain why side="negative" does not detect anomalies.

The reason that it will not detect any anomalies when set to side="negative" is because internally the algorithms in use do not resolve any anomalies due to the nature of the data.

In this specific case VolatilityShiftAD is determined by the product of the iqr_ad and sign_check inputs.

                "and": {
                    "model": AndAggregator(),
                    "input": ["iqr_ad", "sign_check"],
                },

The model AndAggregator()

identifies a time point as anomalous only if it is included in all the input anomaly lists.

The important part is only if it is included in all the input anomaly lists, meaning it must equal 1 in both iqr_ad and sign_check.

We can run through each of these steps and see why it does not detect on negative only.

So from the beginning.

import pandas as pd
from adtk.data import validate_series
from adtk.visualization import plot
from adtk.detector import VolatilityShiftAD

csv_file = '/tmp/adtk/docs/notebooks/data/seismic.csv'
s = pd.read_csv(csv_file, index_col="Time", parse_dates=True, squeeze=True)
s = validate_series(s)
volatility_shift_ad = VolatilityShiftAD(c=6.0, side='both', window=20)
anomalies = volatility_shift_ad.fit_detect(s)
plot(s, anomaly=anomalies, anomaly_color='red')

image

As expected a volatiliy shift was detected.

Let us now demonstrate the issue you are describing.

s = pd.read_csv(csv_file, index_col="Time", parse_dates=True, squeeze=True)
s = validate_series(s)
volatility_shift_ad = VolatilityShiftAD(c=6.0, side='negative', window=20)
anomalies = volatility_shift_ad.fit_detect(s)
plot(s, anomaly=anomalies, anomaly_color='red')

image

As you said no volatiliy shift detected. You would naturally expect that it would detect one where the troughs reach the -50 and -100 range, but that is not the case.

If we look at the methods and data, it is possible to see why there is no anomaly detected.

Let us start with side="positive"

pipnet = VolatilityShiftAD(window=20, side="positive")
s = pd.read_csv(csv_file, index_col="Time", parse_dates=True, squeeze=True)
s = validate_series(s)
anomalies = pipnet.pipe_.fit_detect(s,return_intermediate=True)

Now first let us look at the iqr_ad values which is calculated from InterQuartileRangeAD using the diff_abs as the input data (diff_abs being a result of the series being run through DoubleRollingAggregate)

plot(anomalies["diff_abs"])

image

And diff_abs is used to calculate iqr_ad values

plot(anomalies["iqr_ad"])

image

As we can see around the anomaly region the value is 1

Now let us look at sign_check which is calculated from ThresholdAD using the diff as the input data (diff being a result of the series being run through DoubleRollingAggregate)

plot(anomalies["diff"])

image

And now the sign_check result

plot(anomalies["sign_check"])

image

All 1s.

Now lets plot the iqr_ad and sign_check together.

iqr_ad = anomalies['iqr_ad']
sign_check = anomalies['sign_check']
data = {'iqr_ad': iqr_ad, 'sign_check': sign_check}
df_to_plot = pd.DataFrame(data)
df_to_plot.plot(figsize=(18, 6))

image

You can see that they are both equal to 1 where the anomaly was detected.

Now doing the same with side=negative

pipnet = VolatilityShiftAD(window=20, side="negative")
s = pd.read_csv(csv_file, index_col="Time", parse_dates=True, squeeze=True)
s = validate_series(s)
anomalies = pipnet.pipe_.fit_detect(s,return_intermediate=True)

Here diff_abs, iqr_ad and diff are the same as positive but sign_check is different.

plot(anomalies["sign_check"])

image

If we plot the iqr_ad and sign_check together we can see that there is no point where they both equal 1.

iqr_ad = anomalies['iqr_ad']
sign_check = anomalies['sign_check']
data = {'iqr_ad': iqr_ad, 'sign_check': sign_check}
df_to_plot = pd.DataFrame(data)
df_to_plot.plot(figsize=(18, 6))

image

Conclusion

The algorithm/s are working as expected, it just so happens that the ensemble of algorithms does not concur that there is anomalous negative changes because the historical interquartile range defines the volatility shift happening before the point where you might expect anomalous negative changes to have triggered, by the time the troughs drop down to the -50 and -100 range, the historical interquartile range has dropped back down to 0, so no matter how volatile the diff/sign_check may be, in terms of the definitions to the VolatilityShiftAD it is not anomalous.

I hope this explains issue for you. With regards to your Excel question, I have no comment.

arthemis911222 commented 2 years ago

Thanks for you answer,but I also have some question. @earthgecko

VolatilityShiftAD can detects the anomaly that the ‘std’ values of left sliding window and right sliding window have large different, no matter the time series is from ‘smoothly’(std value is small) to ‘roughly’(std value is big) or ‘roughly’ to ‘smoothly’, is that? Maybe like the LevelShiftAD.

VolatilityShiftAD detects shift of volatility level by tracking the difference between standard deviations at two sliding time windows next to each other.

In my realization, smoothly to roughly is like the anomaly in 'seismic.csv'. And I create the anomaly(roughly to smoothly) by copy the data to the end of the 'seismic.csv'. The std value of the copy data is small. The std value of the data which left near the copy data is big. (Maybe it is not a good way) Such as:

VolatilityShiftAD-3

At the created anomaly, the std values of the left window and right window are also large different, but VolatilityShiftAD can not detect it, why?

I want to find the answer, so I read the code about VolatilityShiftAD and print the 'diff_abs' figure:

agg="std"
self.pipe_ = Pipenet(
            {
                "diff_abs": {
                    "model": DoubleRollingAggregate(
                        agg=agg,
                        window=window,
                        center=True,
                        min_periods=min_periods,
                        diff="abs_rel_diff",
                    ),
             ...

In my realization, the std values of two windows near the created anomaly is large different, so it will obviously show in “diff_abs”. But not. Why? Look forward to you reply, thanks!
VSAD-std

earthgecko commented 2 years ago

@arthemis911222 thanks for the description of how you created your data set, I have reproduced it and it is being in the following explanation.

The issue you are experiencing with your method is probably because your method is not calculating the diff in the same way the DoubleRollingAggregate does. I suspect that your method is just calculating diff.

Step by step what the different diff methods result in with your created anomaly time series.

from adtk.transformer import DoubleRollingAggregate, RollingAggregate
s_copy = s.copy()
s_transformed = DoubleRollingAggregate(
            agg='std',
            window=window,
            center=True,
            min_periods=None,
            diff="abs_rel_diff").transform(s_copy).rename("Diff double rolling std (mm)")
plot(pd.concat([s_copy, s_transformed], axis=1))

image

This is what VolatilityShiftAD is calculating be default ^^

The DoubleRollingAggregate steps are

s_copy = s.copy()
s_rolling_left = RollingAggregate(
            agg='std',
            window=window,
            center=False,
            min_periods=None).transform(s_copy.shift(1)).rename("rolling left - std (mm)")
plot(s_rolling_left)

image

rolling left window ^^

s_rolling_right = pd.Series(
    RollingAggregate(
        agg='std',
        window=20,
        center=False,
    )
    .transform(s_copy.iloc[::-1])
    .iloc[::-1]
)
s_rolling_right.name = "rolling right - std (mm)"
plot(s_rolling_right)

image

rolling right window ^^

Now the part which you may not be replicating in your method is using the diff_abs I suspect. We can now look at the results of the different diff methods.

diff_abs = abs(s_rolling_right - s_rolling_left) / s_rolling_left
plot(diff_abs, title='diff_abs')

image

diff = s_rolling_right - s_rolling_left
plot(diff, title='diff')

image

If you want to reproduce the results ensure that you use abs(s_rolling_right - s_rolling_left) / s_rolling_left to calculate your diff.

I hope this explains why, good luck.

And even if the VolatilityShiftAD class was modified to use diff instead of diff_abs the outcome would probably not be what you assume it would be.

                "diff_abs": {
                    "model": DoubleRollingAggregate(
                        agg=agg,
                        window=window,
                        center=True,
                        min_periods=min_periods,
                        diff="diff",
                    ),
                    "input": "original",
                },

image

arthemis911222 commented 2 years ago

I get it! Because the std value of left window at first anomaly is small while it at second anomaly is big. And they impact the result. thanks a lot! @earthgecko