caisr-hh / group-anomaly-detection

GRAND: Group-based Anomaly Detection for Large-Scale Monitoring of Complex Systems
MIT License
15 stars 9 forks source link

What is the expected input of IndividualAnomalyTransductive()? #4

Closed filipwastberg closed 4 years ago

filipwastberg commented 4 years ago

I succesfully installed your package and manage to run through the example. However, when trying with simulated data I get an error message.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

from grand import IndividualAnomalyInductive, IndividualAnomalyTransductive, GroupAnomaly

df = pd.read_csv("https://raw.githubusercontent.com/Ferrologic/simulated-data/master/simulated_data.csv", parse_dates=True, header = 0)

df["timestamp"] = pd.to_datetime(df['timestamp'])

df.columns = ["timestamp", "value"]

df.plot(x = "timestamp", y = "value")

model = IndividualAnomalyTransductive(ref_group = ["day-of-week"], w_martingale = 100)

for t, x in zip(df.index, df.values):
    info = model.predict(t, x)
    print("Time: {} ==> strangeness: {}, deviation: {}".format(t, info.strangeness, info.deviation), end="\r")

And the error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
 in 
      2 
      3 for t, x in zip(df.index, df.values):
----> 4     info = model.predict(t, x)
      5     print("Time: {} ==> strangeness: {}, deviation: {}".format(t, info.strangeness, info.deviation), end="\r")

/anaconda3/envs/myenv/lib/python3.7/site-packages/grand/individual_anomaly/individual_anomaly_transductive.py in predict(self, dtime, x, external)
     87 
     88         self.T.append(dtime)
---> 89         self._fit(dtime, x, external)
     90 
     91         strangeness, diff, representative = self.strg.predict(x)

/anaconda3/envs/myenv/lib/python3.7/site-packages/grand/individual_anomaly/individual_anomaly_transductive.py in _fit(self, dtime, x, external)
    146             df_sub = self.df.append(self.df_init)
    147             for criterion in self.ref_group:
--> 148                 current = dt2num(dtime, criterion)
    149                 historical = np.array([dt2num(dt, criterion) for dt in df_sub.index])
    150                 df_sub = df_sub.loc[(current == historical)]

/anaconda3/envs/myenv/lib/python3.7/site-packages/grand/utils.py in dt2num(dt, criterion)
     53         elif criterion == "season-of-year":
     54             season = {12: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3, 9: 4, 10: 4, 11: 4}
---> 55             return season[dt.month]
     56         else:
     57             raise InputValidationError("Unknown criterion {} in ref_group.".format(criterion))

AttributeError: 'int' object has no attribute 'month'

What is the expected input in IndividualAnomalyTransductive() and is there any specific documentation besides the example Notebook?

Mohamed-Rafik-Bouguelia commented 4 years ago

The function model.predict(t, x) expects t to be a datetime and x to be a numpy array representing a feature-vector (i.e. data-point).

From what I can see, the index of your dataframe is an integer, therefore, at each iteration of the loop for t, x in zip(df.index, df.values): ... the value of your t is an integer (and not a datetime as expected) and your data-point x has a timestamp included in it (while it is expected to be a feature-vector, without time).

A simple way to change your code is to just define the timestamp column as an index. Here is a working code (with comments added where I changed something) :

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

from grand import IndividualAnomalyInductive, IndividualAnomalyTransductive, GroupAnomaly

df = pd.read_csv("simulated_data.csv", parse_dates=True, header = 0, index_col=0) # Added index_col=0
df.index = pd.to_datetime(df.index) # The timestamps are now our index column
df.columns = ["value"] # the columns (features) excluding the index

df.plot() # there is no column "timestamp" now, it's the index
plt.show()

model = IndividualAnomalyTransductive(ref_group = ["day-of-week"], w_martingale = 100)

# You can also try with "season-of-year" as the periodicity in your data seems seasonal
# model = IndividualAnomalyTransductive(ref_group = ["season-of-year"], w_martingale = 100)

for t, x in zip(df.index, df.values):
    info = model.predict(t, x)
    print("Time: {} ==> strangeness: {}, deviation: {}".format(t, info.strangeness, info.deviation), end="\r")

# Just added this line to see the results
model.plot_deviations(figsize=(12, 8), plots=["data", "strangeness", "deviation", "pvalue", "threshold"])

Regarding your second question, the expected input of IndividualAnomalyTransductive() is as described in the example Notebook:

model = IndividualAnomalyTransductive(
            ref_group = ["day-of-week"] # Criteria to use to construct reference data (check the notebook examples to see other possible criteria to use).
            external_percentage = 0.3   # Percentage of samples to pick from historical data in the case where ref_group is set to "external".

            # The following parameters are the same as in IndividualAnomalyInductive
            non_conformity = "knn",     # Strangeness measure, e.g. "knn" or "median"
            k = 20                      # Used if non_conformity is "knn"
            w_martingale = 15,          # Window size used for computing the deviation level
            dev_threshold = 0.6,        # Threshold on the deviation level (in [0, 1])
            columns=None                # Optional feature names (for interpreting the results)
)

There is no other documentation for the moment besides the explanations given on the example Notebook. However, it will come in near future.

filipwastberg commented 4 years ago

That's great. Thanks. I really think that some documentation of the functions would be a great feature.

Furthermore, I think it would be great if we could be able to install the package with pip install git+https://github.com/caisr-hh/group-anomaly-detection, instead of having to clone the whole project and then installing it. Is that something you are considering?