facebookresearch / Kats

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.
MIT License
5.16k stars 544 forks source link

Multivariate Anomaly Dectector (Error when running tutorial) #89

Open loicduffar opened 3 years ago

loicduffar commented 3 years ago

I get the error below when I run the tutorial kats_202_detection.ipynb https://github.com/facebookresearch/Kats/blob/master/tutorials/kats_202_detection.ipynb Any clue ?

KeyError: Timestamp('2019-12-23 23:59:58.142906')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-265-b73347c84449> in <module>
      3 d = MultivariateAnomalyDetector(multi_anomaly_ts, params, training_days=60)
      4 display(params)
----> 5 anomaly_score_df = d.detector()
      6 
      7 d.plot()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\kats\detectors\outlier.py in detector(self)
    300         while fcstTime < self.df.index.max():
    301             # forecast for fcstTime+ 1
--> 302             pred_df = self._generate_forecast(fcstTime)
    303             # calculate anomaly scores
    304             anomaly_scores_t = self._calc_anomaly_scores(pred_df)
Chima-21 commented 3 years ago

Excellent package. Have the same issue when running the example and other multivariate anomaly detections. Seems to stem from pandas handling of the 'time' column.

`
~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 1575417598142906000

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()

KeyError: Timestamp('2019-12-03 23:59:58.142906')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
    701         try:
--> 702             return Index.get_loc(self, key, method, tolerance)
    703         except KeyError as err:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 

KeyError: Timestamp('2019-12-03 23:59:58.142906')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-51-5b7def4515a6> in <module>
      1 params = VARParams(maxlags=2)
      2 d = MultivariateAnomalyDetector(multi_anomaly_ts, params, training_days=40)
----> 3 anomaly_score_df = d.detector()
      4 d.plot()

~\anaconda3\lib\site-packages\kats\detectors\outlier.py in detector(self)
    300         while fcstTime < self.df.index.max():
    301             # forecast for fcstTime+ 1
--> 302             pred_df = self._generate_forecast(fcstTime)
    303             # calculate anomaly scores
    304             anomaly_scores_t = self._calc_anomaly_scores(pred_df)

~\anaconda3\lib\site-packages\kats\detectors\outlier.py in _generate_forecast(self, t)
    244             "index"
    245         )
--> 246         test = self.df.loc[t + dt.timedelta(days=self.granularity_days), :]
    247         pred_df["actual"] = test
    248 

~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    923                 with suppress(KeyError, IndexError):
    924                     return self.obj._get_value(*key, takeable=self._takeable)
--> 925             return self._getitem_tuple(key)
    926         else:
    927             # we by definition only have the 0th axis

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   1098     def _getitem_tuple(self, tup: tuple):
   1099         with suppress(IndexingError):
-> 1100             return self._getitem_lowerdim(tup)
   1101 
   1102         # no multi-index, so validate all of the indexers

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
    836                 # We don't need to check for tuples here because those are
    837                 #  caught by the _is_nested_tuple_indexer check above.
--> 838                 section = self._getitem_axis(key, axis=i)
    839 
    840                 # We should never have a scalar section here, because

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1162         # fall thru to straight lookup
   1163         self._validate_key(key, axis)
-> 1164         return self._get_label(key, axis=axis)
   1165 
   1166     def _get_slice_axis(self, slice_obj: slice, axis: int):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_label(self, label, axis)
   1111     def _get_label(self, label, axis: int):
   1112         # GH#5667 this will fail if the label is not present in the axis.
-> 1113         return self.obj.xs(label, axis=axis)
   1114 
   1115     def _handle_lowerdim_multi_index_axis0(self, tup: tuple):

~\anaconda3\lib\site-packages\pandas\core\generic.py in xs(self, key, axis, level, drop_level)
   3771                 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
   3772         else:
-> 3773             loc = index.get_loc(key)
   3774 
   3775             if isinstance(loc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
    702             return Index.get_loc(self, key, method, tolerance)
    703         except KeyError as err:
--> 704             raise KeyError(orig_key) from err
    705 
    706     def _maybe_cast_for_get_loc(self, key) -> Timestamp:

KeyError: Timestamp('2019-12-03 23:59:58.142906')

`

pbosch commented 3 years ago

@loicduffar @Chima-21 Do you use Windows or Linux? I encountered the issue myself in Windows using Miniconda (4.10.3) and using WSL or a native Ubuntu doesn't produce the issue. In WSL/Ubuntu it works with native Python and Miniconda.

The culprit is in in calculating the granularity:

       if len(time_diff.unique()) == 1:  # check constant frequenccy
            freq = time_diff.unique()[0].astype("int")
            self.granularity_days = freq / (24 * 3600 * (10 ** 9))
        else:
            raise RuntimeError(
                "Frequency of metrics is not constant."
                "Please check for missing or duplicate values"
            )

In WSL/Ubuntu I get a straight 1.0 for the example data. In Windows I get -2.149413925925926e-05. More precisely, the issue is in the astype call. It seems to default to int32 on Windows, which causes an overflow. Using int64 instead of int solves the problem.

I would recommend in this instance, and in general, to use explicit types instead of assuming that the default type is correct. Int on Windows systems usually defaults to 32bit while float usually defaults to 64bit. By marking them explicit with int64 and float64, whether it is with numpy or pandas, you would avoid that issue completely and make it a bit more robust. A quick search for astype in the repository shows that it's mostly implicit, so it could be that this kind of problem occurs in other places as well.

Chima-21 commented 3 years ago

@loicduffar @Chima-21 Do you use Windows or Linux? I encountered the issue myself in Windows using Miniconda (4.10.3) and using WSL or a native Ubuntu doesn't produce the issue. In WSL/Ubuntu it works with native Python and Miniconda.

The culprit is in in calculating the granularity:

       if len(time_diff.unique()) == 1:  # check constant frequenccy
            freq = time_diff.unique()[0].astype("int")
            self.granularity_days = freq / (24 * 3600 * (10 ** 9))
        else:
            raise RuntimeError(
                "Frequency of metrics is not constant."
                "Please check for missing or duplicate values"
            )

In WSL/Ubuntu I get a straight 1.0 for the example data. In Windows I get -2.149413925925926e-05. More precisely, the issue is in the astype call. It seems to default to int32 on Windows, which causes an overflow. Using int64 instead of int solves the problem.

I would recommend in this instance, and in general, to use explicit types instead of assuming that the default type is correct. Int on Windows systems usually defaults to 32bit while float usually defaults to 64bit. By marking them explicit with int64 and float64, whether it is with numpy or pandas, you would avoid that issue completely and make it a bit more robust. A quick search for astype in the repository shows that it's mostly implicit, so it could be that this kind of problem occurs in other places as well.

Many thanks. Issue resolved!!

krishpn commented 2 years ago

@pbosch May be related but when I run the command I get the same error. Is there a different requirement if the dates column has to be some specific data types?

I have a pandas dataframe multi_ts is a dataset with time as time series data object, created using pandas.to_datetime(multi_ts['time'])

f``` rom kats.models.var import VARModel, VARParams from kats.detectors.outlier import MultivariateAnomalyDetector, MultivariateAnomalyDetectorType from kats.models.var import VARParams params = VARParams(maxlags=2) m = VARModel(multi_ts, params) m.fit() steps=100

params = VARParams(maxlags=2) d = MultivariateAnomalyDetector(multi_ts, params, training_days=60) anomaly_score_df = d.detector() d.plot()

multi_ts is a dataset with 'time' as time series object, created using pd.to_datetime(multi_ts['time']
/Documents/personal/resume/test.ipynb Cell 80' in <cell line: 10>()

File ~/miniconda3/lib/python3.8/site-packages/kats/detectors/outlier.py:191, in MultivariateAnomalyDetector.init(self, data, params, training_days, model_type) 189 self.granularity_days: float = freq / (24 3600 (10 ** 9)) 190 else: --> 191 raise RuntimeError( 192 "Frequency of metrics is not constant." 193 "Please check for missing or duplicate values" 194 ) 196 self.training_days = training_days 197 self.detector_model = model_type

RuntimeError: Frequency of metrics is not constant.Please check for missing or duplicate values

michaelbrundage commented 2 years ago

Renaming the issue to the root cause.

Most likely, we should mark Kats as requiring 64-bit. I think we don't intend to support legacy 32-bit Python installations.

waqarahmed6095 commented 1 week ago

I had this issue on windows but i do not have this issue on Linux (WSL)