Open loicduffar opened 3 years ago
Excellent package. Have the same issue when running the example and other multivariate anomaly detections. Seems to stem from pandas handling of the 'time' column.
`
~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 1575417598142906000
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
KeyError: Timestamp('2019-12-03 23:59:58.142906')
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
701 try:
--> 702 return Index.get_loc(self, key, method, tolerance)
703 except KeyError as err:
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
KeyError: Timestamp('2019-12-03 23:59:58.142906')
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-51-5b7def4515a6> in <module>
1 params = VARParams(maxlags=2)
2 d = MultivariateAnomalyDetector(multi_anomaly_ts, params, training_days=40)
----> 3 anomaly_score_df = d.detector()
4 d.plot()
~\anaconda3\lib\site-packages\kats\detectors\outlier.py in detector(self)
300 while fcstTime < self.df.index.max():
301 # forecast for fcstTime+ 1
--> 302 pred_df = self._generate_forecast(fcstTime)
303 # calculate anomaly scores
304 anomaly_scores_t = self._calc_anomaly_scores(pred_df)
~\anaconda3\lib\site-packages\kats\detectors\outlier.py in _generate_forecast(self, t)
244 "index"
245 )
--> 246 test = self.df.loc[t + dt.timedelta(days=self.granularity_days), :]
247 pred_df["actual"] = test
248
~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
923 with suppress(KeyError, IndexError):
924 return self.obj._get_value(*key, takeable=self._takeable)
--> 925 return self._getitem_tuple(key)
926 else:
927 # we by definition only have the 0th axis
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
1098 def _getitem_tuple(self, tup: tuple):
1099 with suppress(IndexingError):
-> 1100 return self._getitem_lowerdim(tup)
1101
1102 # no multi-index, so validate all of the indexers
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
836 # We don't need to check for tuples here because those are
837 # caught by the _is_nested_tuple_indexer check above.
--> 838 section = self._getitem_axis(key, axis=i)
839
840 # We should never have a scalar section here, because
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1162 # fall thru to straight lookup
1163 self._validate_key(key, axis)
-> 1164 return self._get_label(key, axis=axis)
1165
1166 def _get_slice_axis(self, slice_obj: slice, axis: int):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_label(self, label, axis)
1111 def _get_label(self, label, axis: int):
1112 # GH#5667 this will fail if the label is not present in the axis.
-> 1113 return self.obj.xs(label, axis=axis)
1114
1115 def _handle_lowerdim_multi_index_axis0(self, tup: tuple):
~\anaconda3\lib\site-packages\pandas\core\generic.py in xs(self, key, axis, level, drop_level)
3771 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
3772 else:
-> 3773 loc = index.get_loc(key)
3774
3775 if isinstance(loc, np.ndarray):
~\anaconda3\lib\site-packages\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
702 return Index.get_loc(self, key, method, tolerance)
703 except KeyError as err:
--> 704 raise KeyError(orig_key) from err
705
706 def _maybe_cast_for_get_loc(self, key) -> Timestamp:
KeyError: Timestamp('2019-12-03 23:59:58.142906')
`
@loicduffar @Chima-21 Do you use Windows or Linux? I encountered the issue myself in Windows using Miniconda (4.10.3) and using WSL or a native Ubuntu doesn't produce the issue. In WSL/Ubuntu it works with native Python and Miniconda.
The culprit is in in calculating the granularity:
if len(time_diff.unique()) == 1: # check constant frequenccy
freq = time_diff.unique()[0].astype("int")
self.granularity_days = freq / (24 * 3600 * (10 ** 9))
else:
raise RuntimeError(
"Frequency of metrics is not constant."
"Please check for missing or duplicate values"
)
In WSL/Ubuntu I get a straight 1.0 for the example data. In Windows I get -2.149413925925926e-05. More precisely, the issue is in the astype call. It seems to default to int32 on Windows, which causes an overflow. Using int64 instead of int solves the problem.
I would recommend in this instance, and in general, to use explicit types instead of assuming that the default type is correct. Int on Windows systems usually defaults to 32bit while float usually defaults to 64bit. By marking them explicit with int64 and float64, whether it is with numpy or pandas, you would avoid that issue completely and make it a bit more robust. A quick search for astype in the repository shows that it's mostly implicit, so it could be that this kind of problem occurs in other places as well.
@loicduffar @Chima-21 Do you use Windows or Linux? I encountered the issue myself in Windows using Miniconda (4.10.3) and using WSL or a native Ubuntu doesn't produce the issue. In WSL/Ubuntu it works with native Python and Miniconda.
The culprit is in in calculating the granularity:
if len(time_diff.unique()) == 1: # check constant frequenccy freq = time_diff.unique()[0].astype("int") self.granularity_days = freq / (24 * 3600 * (10 ** 9)) else: raise RuntimeError( "Frequency of metrics is not constant." "Please check for missing or duplicate values" )
In WSL/Ubuntu I get a straight 1.0 for the example data. In Windows I get -2.149413925925926e-05. More precisely, the issue is in the astype call. It seems to default to int32 on Windows, which causes an overflow. Using int64 instead of int solves the problem.
I would recommend in this instance, and in general, to use explicit types instead of assuming that the default type is correct. Int on Windows systems usually defaults to 32bit while float usually defaults to 64bit. By marking them explicit with int64 and float64, whether it is with numpy or pandas, you would avoid that issue completely and make it a bit more robust. A quick search for astype in the repository shows that it's mostly implicit, so it could be that this kind of problem occurs in other places as well.
Many thanks. Issue resolved!!
@pbosch May be related but when I run the command I get the same error. Is there a different requirement if the dates column has to be some specific data types?
I have a pandas dataframe multi_ts
is a dataset with time
as time series data object, created using pandas.to_datetime(multi_ts['time']
)
f``` rom kats.models.var import VARModel, VARParams from kats.detectors.outlier import MultivariateAnomalyDetector, MultivariateAnomalyDetectorType from kats.models.var import VARParams params = VARParams(maxlags=2) m = VARModel(multi_ts, params) m.fit() steps=100
params = VARParams(maxlags=2) d = MultivariateAnomalyDetector(multi_ts, params, training_days=60) anomaly_score_df = d.detector() d.plot()
multi_ts is a dataset with 'time' as time series object, created using pd.to_datetime(multi_ts['time']
/Documents/personal/resume/test.ipynb Cell 80' in <cell line: 10>()
File ~/miniconda3/lib/python3.8/site-packages/kats/detectors/outlier.py:191, in MultivariateAnomalyDetector.init(self, data, params, training_days, model_type) 189 self.granularity_days: float = freq / (24 3600 (10 ** 9)) 190 else: --> 191 raise RuntimeError( 192 "Frequency of metrics is not constant." 193 "Please check for missing or duplicate values" 194 ) 196 self.training_days = training_days 197 self.detector_model = model_type
RuntimeError: Frequency of metrics is not constant.Please check for missing or duplicate values
Renaming the issue to the root cause.
Most likely, we should mark Kats as requiring 64-bit. I think we don't intend to support legacy 32-bit Python installations.
I had this issue on windows but i do not have this issue on Linux (WSL)
I get the error below when I run the tutorial kats_202_detection.ipynb https://github.com/facebookresearch/Kats/blob/master/tutorials/kats_202_detection.ipynb Any clue ?