arundo / adtk

A Python toolkit for rule-based/unsupervised anomaly detection in time series
https://adtk.readthedocs.io
Mozilla Public License 2.0
1.07k stars 144 forks source link

RuntimeError: the model must be trained first #65

Closed FGG100y closed 4 years ago

FGG100y commented 4 years ago

First of all, thx for the great tool ^^

here's the code that produce this RuntimeError:

tmdl.py

from adtk.detector import ThresholdAD
from adtk.detector import QuantileAD
from adtk.detector import InterQuartileRangeAD
from adtk.detector import PersistAD
from adtk.detector import LevelShiftAD
from adtk.detector import VolatilityShiftAD
from adtk.detector import SeasonalAD
from adtk.detector import AutoregressionAD

def get_detector(adname="ThresholdAD"):
    detectors = {"ThresholdAD": ThresholdAD,
                 "QuantileAD": QuantileAD,
                 "InterQuartileRangeAD": InterQuartileRangeAD,
                 "PersistAD": PersistAD,
                 "LevelShiftAD": LevelShiftAD,
                 "VolatilityShiftAD": VolatilityShiftAD,
                 "SeasonalAD": SeasonalAD,
                 "AutoregressionAD": AutoregressionAD,
                 }
   return detectors.get(adname)

\# using adtk anomoly detectors
def ad_detector(dname, train_data=None, test_data=None, **kwargs):
    Ad = get_detector(dname)
    ad = Ad(**kwargs)
    train_anoms = ad.fit_detect(train_data)
    test_anoms = ad.detect(test_data)
    return train_anoms, test_anoms

I wrote these functions to help me quickly doing some experiment with different detectors by changing the detector's name in the main() function. That's what I thought. --!

main.py

...(functons read the data)

s_train, s_test = split_train_test(data, mode=split_mode, n_splits=n_splits)
train_anoms, test_anoms = [], []
for train, test in zip(s_train, s_test):  # the Error show up in this for loop
        train_anom, test_anom = tmdl.ad_detector(dname='SeasonalAD',
                                                 train_data=train,
                                                 test_data=test.squeeze(),
                                                 c=1, side='both')
        # collect the results
        train_anoms.append(train_anom)
        test_anoms.append(test_anom)

When ran this piece of code, it reported RuntimeError: the model must be trained first.

Last but not least, when I followed the Quick Start, the machine did not complain anything.

Any help would be appreciated.

tailaiw commented 4 years ago

@FGG100y That's strange. I ran the following code which I believe is equivalent to what you described above, and it didn't give me any error.

Which version of ADTK are you using?

import numpy as np
import pandas as pd

from adtk.detector import ThresholdAD
from adtk.detector import QuantileAD
from adtk.detector import InterQuartileRangeAD
from adtk.detector import PersistAD
from adtk.detector import LevelShiftAD
from adtk.detector import VolatilityShiftAD
from adtk.detector import SeasonalAD
from adtk.detector import AutoregressionAD
from adtk.data import split_train_test

def get_detector(adname="ThresholdAD"):
    detectors = {"ThresholdAD": ThresholdAD,
                 "QuantileAD": QuantileAD,
                 "InterQuartileRangeAD": InterQuartileRangeAD,
                 "PersistAD": PersistAD,
                 "LevelShiftAD": LevelShiftAD,
                 "VolatilityShiftAD": VolatilityShiftAD,
                 "SeasonalAD": SeasonalAD,
                 "AutoregressionAD": AutoregressionAD,
                 }
    return detectors.get(adname)

def ad_detector(dname, train_data=None, test_data=None, **kwargs):
    Ad = get_detector(dname)
    ad = Ad(**kwargs)
    train_anoms = ad.fit_detect(train_data)
    test_anoms = ad.detect(test_data)
    return train_anoms, test_anoms

data = pd.Series(np.sin(np.arange(100)), index=pd.date_range(start="2020-02-02", periods=100, freq="D"))

s_train, s_test = split_train_test(data, mode=3, n_splits=2)
train_anoms, test_anoms = [], []
for train, test in zip(s_train, s_test):
    train_anom, test_anom = ad_detector(dname='SeasonalAD',
                                        train_data=data,
                                        test_data=data.squeeze(),
                                        c=1, side='both')
    # collect the results
    train_anoms.append(train_anom)
    test_anoms.append(test_anom)
FGG100y commented 4 years ago

@tailaiw Thank you for your reply. The adtk version: 0.5.2 I used the same syntheses data as yours, and it reported no error. So I believed it's something wrong with my data. And this was how I deal with the data(preprocessing):

# replace the NaNs with the median deal to some extreme larger abnormal values
# if not replace the NaNs, adtk reported "NaNs between valid values were not allowed"
quantiles = data.quantile([0.01, 0.99]).values.flatten()
q_high, q_low = quantiles[1], quantiles[0]
data[data[fname.split('_')[-1]] < q_low] = NaN
data[data[fname.split('_')[-1]] > q_high] = NaN
data = data.replace(NaN, data.median())

The split-train-test timeseries: ts_data_split_mode1

Am I missing something in adtk Docs, or there is something wrong with the data?

FGG100y commented 4 years ago

@tailaiw And this was the data that I used in this case: ts_debug.txt

tailaiw commented 4 years ago

@FGG100y It looks the problem is related to the fact that your input is a Dataframe instead of a Series object. I will look into this. It is probably a bug. Thanks for catching this!

I noticed your data is univariate. So before we fix the problem, what you can do is putting your data in a Series instead of a single-column DataFrame. I replaced the synthetic data with your data (i.e. replacing the line of data generation with data = pd.read_csv("./ts_debug.txt", parse_dates=True, squeeze=True, index_col=0). It returns no error. If I load the data with option squeeze=False, i.e. loading the data into a DataFrame, it will hit the RuntimeError you mentioned.

FGG100y commented 4 years ago

@tailaiw Like your suggestions^^, and I have solved my problems. Thanks a lot.

tailaiw commented 4 years ago

@FGG100y We dived into the problem you mentioned and found the problem is as follows:

ADTK contains univariate models and multivariate models. The models you were using are all univariate models. By design, if a univariate model is applied to a DataFrame, it treats each column of the DataFrame as an independent time series.

If a model is trained with a Series and is applied to a DataFrame, ADTK will apply the model to each column independently and returns a concatenated DataFrame as output.

If a model is trained with DataFrame (say with columns "A", "B", and "C"), what happens on the backend is that ADTK trains 3 models respectively. If the model object is then applied to a DataFrame with the same column names, ADTK will match the trained models with columns automatically. We found this design convenient for the case where a certain type of model is applied to a large number of time series.

If a model trained by a DataFrame is applied to a Series or a DataFrame with different column names, ADTK will throw an error because it cannot find the matching. This is what caused the error you encountered (note that your training data is a DataFrame while your testing data is a Series because you used squeeze method).

This logic was not well tested or documented, and the error message was misleading. Thanks to your issue, we noticed this problem and fixed it. We just released a patch v0.5.4 to address this issue. If you see anything missed, please feel free to reopen this issue.