arundo / adtk

A Python toolkit for rule-based/unsupervised anomaly detection in time series
https://adtk.readthedocs.io
Mozilla Public License 2.0
1.06k stars 143 forks source link

Are fit_detectors thread-safe as seeing problem with returned anomalies #127

Open joriws opened 3 years ago

joriws commented 3 years ago

I fetch multiple timeseries data to Pandas DataFrame and validate_data and feed it to Pca_AD. Single threading serial execution worked fine, but with converting to threads to parallel execution on 3 parallel threads I get random result with anomalies-returned and drive to to_event casts TypeError. Plotting data is normal graph pattern and anomaly=anomalies plots normally, but to_events does not "complete". Between different runs different call to_events fails, like first dataset 3 then next run maybe 2 and 3 is ok. Third run could be that dataset 2 works but 1/3 are not.

I've tried also with threads.local() but it does not change anything. Without threading I did not observe this behaviour.

type is same for all <class 'pandas.core.series.Series'>

pca_ad = PcaAD(k=k,c=c)
anomalies = pca_ad.fit_detect(pdata)
plot(pdata, anomaly=anomalies, ts_linewidth=1, ts_markersize=2, anomaly_color='red', anomaly_alpha=0.3, curve_group='all', axes=axis)
try:
     for startano,endano in to_events(anomalies):
        ...
except TypeError:
        logging.error("cannot expand to_events\n{}".format(to_events(anomalies)))

When checking the output for logging.error for some reason there is no "freq"-parameter, anomaly data which has freq works well. Also non-working returns time stamps not time ranges.

Non-working

[Timestamp('2021-04-18 08:35:00+0000', tz='UTC'), Timestamp('2021-04-18 10:30:00+0000', tz='UTC'), Timestamp('2021-04-18 10:35:00+0000', tz='UTC'), Timestamp('2021-04-18 13:25:00+0000', tz='UTC'), 

Working structure.

[(Timestamp('2021-04-18 22:00:00+0000', tz='UTC', freq='5T'), Timestamp('2021-04-18 22:04:59.999999999+0000', tz='UTC', freq='5T')), (Timestamp('2021-04-18 22:25:00+0000', tz='UTC', freq='5T'), Timestamp('2021-04-18 22:34:59.999999999+0000', tz='UTC', freq='5T')), (Timestamp('2021-04-18 22:40:00+0000', tz='UTC', freq='5T'), Timestamp('2021-04-18 22:49:59.999999999+0000', tz='UTC', freq='5T')
joriws commented 3 years ago

pip show adtk
Name: adtk Version: 0.6.2 Summary: A package for unsupervised time series anomaly detection Home-page: https://github.com/arundo/adtk Author: Arundo Analytics, Inc. Author-email: None License: Mozilla Public License 2.0 (MPL 2.0) Location: c:\users\guest\appdata\local\packages\pythonsoftwarefoundation.python.3.9_qbz5n2kfra8p0\localcache\local-packages\python39\site-packages Requires: numpy, scikit-learn, packaging, pandas, tabulate, matplotlib, statsmodels Required-by:

joriws commented 3 years ago

Also tested with and no change of outcome:

for startano,endano in to_events(anomalies, freq_as_period=True, merge_consecutive=True):
earthgecko commented 3 years ago

Hi @joriws

From one user to another. This is probably not a adtk issue. Underlying pandas itself is not thread safe. https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#thread-safety

I hope this helps in the future.