arundo / adtk

A Python toolkit for rule-based/unsupervised anomaly detection in time series
https://adtk.readthedocs.io
Mozilla Public License 2.0
1.06k stars 143 forks source link

Question on LevelShiftAD #131

Closed phaabe closed 3 years ago

phaabe commented 3 years ago

I have created a simple example. ​

from adtk.detector import LevelShiftAD
from adtk.visualization import plot
import pandas

d1 = [1, 10000, 1,     10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 1]
d2 = [1, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 1]
s = pandas.Series(d1, index=pandas.date_range("2021-01-01", periods=len(d)))

level_shift_ad = LevelShiftAD(c=6.0, side='both', window=2)
anomalies = level_shift_ad.fit_detect(s)

plot(s, anomaly=anomalies, anomaly_color='red');

With d2 two anomalies are detected. With d1 no anomalies are detected.

Why? Or maybe the question must be: What should I look at to understand? :)

Thanks

earthgecko commented 3 years ago

Hi @phaabe

This is because the data set you are using is very small.

However if you just use data sets that are bit larger, you should see the results your were expecting to see. Transformations and computations of this nature on very small samples may not always work as you expect them too.

You are using 14 values, try with 16 and you will see the result you were expecting.

d1 = [1, 10000, 1,     10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 1]
d1Longer = [1, 10000, 1, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 1]
s = pandas.Series(d1Longer, index=pandas.date_range("2021-01-01", periods=len(d1Longer)))
level_shift_ad = LevelShiftAD(c=6.0, side='both', window=2)
anomalies = level_shift_ad.fit_detect(s)
plot(s, anomaly=anomalies, anomaly_color='red')

image

Internally the LevelShiftAD algorithm runs through a number of transforms on the data. The DoubleRollingAggregate is the first and it will calculate very different values for d1 and d2 even in the first RollingAggregate, never mind the second.

s = pandas.Series(d1, index=pandas.date_range("2021-01-01", periods=len(d1)))
s.rolling(2).median()
s = pandas.Series(d2, index=pandas.date_range("2021-01-01", periods=len(d2)))
s.rolling(2).median()

As you can see after running those ^^ d1 and d2 are markedly different on the first rolling aggregrate, I will not go so far as to break each computation and its output as internally Pipenet is quite complex.

phaabe commented 3 years ago

Great, thanks