TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.63k stars 318 forks source link

Add `aampi` #221

Closed seanlaw closed 4 years ago

seanlaw commented 4 years ago

This should be fairly straightforward in theory using convolution.

We also need to add this to the API documentation

hmcoservit commented 4 years ago

@seanlaw Very interested in trying out aampi, and comparing it to STUMPI.

seanlaw commented 4 years ago

@hmcoservit Frankly, I am not so sure about aampi because I don't know of an equivalent mass function for unnormalized Euclidean distance. Without an unnormalized mass, the computation might be quite slow by doing it the classic sliding window way.

seanlaw commented 4 years ago

Okay, there is a way to do this using the same convolution trick as found in mass and the original author(s), Mueen, refers to this as mass_absolute:

m = 5
distance_profile = np.sqrt(np.sum(core.rolling_window(T*T, m), axis=1) - 2*core.sliding_dot_product(Q, T) + np.sum(Q*Q))

So, I think this should be relatively straightforward to implement by basing aampi off of stumpi and then adding a corresponding core.mass_absolute function

seanlaw commented 4 years ago

@hmcoservit aampi has been implemented and tests seem to pass and matches standard aamp output.

You can basically replace stumpi with aampi in whatever code you were testing out. Please clone the latest version from Github and let me know what you find and then I can push a new version to PyPI/Conda as soon as possible.

hmcoservit commented 4 years ago

@seanlaw Great thanks, Sure.

hmcoservit commented 4 years ago

@seanlaw Tested AAMPI vs STUMPI the matrix profile of AAMPI seems more interesting, here is the plots:

newplot - 2020-08-20T154719 431 newplot - 2020-08-20T155026 694

I will post more comprehensive results here soon

seanlaw commented 4 years ago

@hmcoservit Thanks for circling back. This certainly looks promising from my perspective and aampi seems to be "better" than stumpi for these kinds of anomalies. Is that what your interpretation is?

hmcoservit commented 4 years ago

@seanlaw certainly seems promising for our use case ;)

Perhaps one can also adapt and extend this by using an ensemble (mix) of AAMPI and STUMPI with different m parameters for different anomaly/outlier detection use cases. AAMPI can be more sensitive towards point anomalies whereas STUMPI can be more sensitive towards collective (pattern) anomalies.

seanlaw commented 4 years ago

I agree! Very cool

seanlaw commented 3 years ago

@hmcoservit I was wondering how things were going? I am curious whether aampi or stumpi has been helpful or useful? Have you found instances where they'd perform poorly?

hmcoservit commented 3 years ago

Hi @seanlaw, sorry I was going to update you sooner on this.

AAMPI definitely helped, with a small subsequence size it was the best performing method on our benchmark dataset compared to other methods that we selected (and of course much faster too, it is passing our scaling tests) --> with anomalies identified using the percentile approach.

Regarding example of instances when it can perform poorly using a small subsequence is for example an absence of a seasonal cycle like this time series or some cases of a changepoint where the variance of the data decreases. Of Couse there exists no perfect method and we don't expect a method to solve every problem.

STUMPI with a bigger subsequence does well in some cases but poorly on other cases, and it is difficult to tune the subsequence size. The problem was that Z-normalization masks the point anomalies (which is one of the most common type).

Currently I am thinking of an ensemble (voting system) combining STUMPI and AAMPI using different subsequence sizes. Let me know what you think ;)

seanlaw commented 3 years ago

AAMPI definitely helped, with a small subsequence size it was the best performing method on our benchmark dataset compared to other methods that we selected (and of course much faster too, it is passing our scaling tests) --> with anomalies identified using the percentile approach.

What percentile have you found to be most useful? 90th percentile?

Also, do you have a good way to account for periodicity? For example, let's say that you have a server that is collecting data every minute for 5 years and you know that there is always a spike at midnight on the first Monday of every month (i.e., there is a monthly backup that occurs at midnight on the first Monday of every month). Since this is a "known" event and is therefore not anomalous, do you have a good way to ignore these spikes in aampi without hardcoding the rule in? I was thinking that one might need to perform some data pre-processing first before using aampi but I would be curious to hear your thoughts as I have not surveyed the literature.

Currently I am thinking of an ensemble (voting system) combining STUMPI and AAMPI using different subsequence sizes. Let me know what you think ;)

I think this is likely the best approach

hmcoservit commented 3 years ago

@seanlaw The percentile can be adjusted based on the sampling of the data and the historical data available. For example, having 99% percentile for data sampled every minute and data sampled every 10 minutes would give different number of anomalies (alerts). --> therefore, might be a good idea to have different percentiles for different sampled time series.

To speed up the process, we re-sample the data to a minimum of 10 minutes. And adjusted the percentile starting from 99.7 to get the best precision and recall across the benchmark. The higher the percentile the higher the precision but lower recall (observation in practice).

Regarding handling periodicity it is a bit more challenging, we do not have a solution for that yet. In an example scenario that you have for example 2 weeks of history with a spike each midnight, (after the second day) AAMPI would have a nearest neighbor (of the spike subsequence) to the left, so it will probably not raise an alarm. But this is kind of naive, because if the position of the spike changes to for example 1AM or these is no spike at all at midnight, AAMPI would probably not raise an alarm. Hope I explained well, I am also interested in efficient ways of detecting seasonality automatically, but I had no luck finding a good approach.

hmcoservit commented 3 years ago

@seanlaw Makes me wonder if it is possible to detect some form of seasonality looking at the positions of the nearest neighbor indexes using matrix profiles.

To add to the seasonality part for detecting seasonality these are the options I am aware of

seanlaw commented 3 years ago

Perhaps, that may be somewhat related to time series chains?

hmcoservit commented 3 years ago

Hi @seanlaw , I hope you are well.

If you are still interested in periodicity (length) detection, I can recommend two papers:

  1. Puech, T., Boussard, M., D’Amato, A., & Millerand, G. (2019, September). A fully automated periodicity detection in time series.
  2. Wen, Q., He, K., Sun, L., Zhang, Y., Ke, M., & Xu, H. (2020). RobustPeriod: Time-frequency mining for robust multiple periodicities detection.

I have also had a go at time series chains to get an idea of periodicity. But it seems to me when subsequences are very similar to each other they do not form a chain (which makes sense according to the definition).

I will see if I can find another way to look at the positions of the nearest neighbor (right and left) using matrix profiles to get an idea of the period.

seanlaw commented 3 years ago

@hmcoservit I was wondering if you explicitly set the excl_zone? In the 1.9.0 release, we will be removing this parameter from the aampi/stumpi functions and, instead, it will move to a global variable in config.py. If you only use the default m/4 then you should see no difference

hmcoservit commented 3 years ago

@seanlaw Hi Sean, thanks for informing me. Until now we were just using the default value for the anomaly detection use case. However, I was planning to do a POC soon to explore positions of the nearest neighbor (right and left) and also analyze the MP to get an idea of the season/period. I think I would need to play with the excl_zone parameter, so thanks for letting me know.

One thing, I have a repo with version stumpy==1.5.0 , and when I set the excl_zone in stumpy.aampi(T_full, m, egress=True, excl_zone = zone_to_exclude) it seems that it is ignored. Do you have an idea why this can happen?

seanlaw commented 3 years ago

One thing, I have a repo with version stumpy==1.5.0 , and when I set the excl_zone in stumpy.aampi(T_full, m, egress=True, excl_zone = zone_to_exclude) it seems that it is ignored. Do you have an idea why this can happen?

What is the value of zone_to_exclude? How are you confirming that "it seems that it is ignored"? From what I can see in the code, it should not be ignored. Are you able to provide a reproducible example?

hmcoservit commented 3 years ago

@seanlaw

`
stream1 = stumpy.aampi(T_full, m, egress=True, excl_zone=1) stream2 = stumpy.aampi(T_full, m, egress=True) print(np.array_equal(stream1.leftP,stream2.leftP))

`

Seems like the first m/4 values are always inf (even in version 1.8.0 here). Maybe I am missing something or misunderstood the definition of excl_zone and trivial matches?

newplot - 2021-06-03T145304 887 newplot - 2021-06-03T145254 852

seanlaw commented 3 years ago

Okay, I think I understand the problem now (thank you!). So, before egress happens, internally, aampi calls aamp in order to generate the first matrix profile, which will be updated as you stream in additional data points. Unfortunately, this call to aamp does NOT recognize/honor the excl_zone parameter passed to aampi and, instead, is defaulting to m/4. The excl_zone is only being applied to the new subsequences being streamed in. In other words, this is a bug!

This is actually a great example of why it makes more sense to make the setting of the exclusion zone a global variable. This way, it is set once and it should propagate everywhere. So, I definitely recommend waiting for v1.9.0 where we would do something like:

import stumpy
from stumpy.config import STUMPY_EXCL_ZONE_DENOM 

m = 1008
STUMPY_EXCL_ZONE_DENOM = m  # This will be converted to `excl_zone = int(np.ceil(m / STUMPY_EXCL_ZONE_DENOM))`
stream1 = stumpy.aampi(T_full, m, egress=True)  # The exclusion zone = 1

And when you want to define no exclusion zone then you'd set STUMPY_EXCL_ZONE_DEOM = np.inf. Does that make sense? Your feedback would be super helpful here!

hmcoservit commented 3 years ago

@seanlaw Thanks, seems good to me. Any visibility on the date of the release?

seanlaw commented 3 years ago

We are still waiting for a few PRs to be completed but really hoping for mid to late June (how is it June already?!). Fingers crossed

hmcoservit commented 3 years ago

Hi @seanlaw , any updates on this :) ?

seanlaw commented 3 years ago

@hmcoservit So this functionality has already been implemented and you can test this in the latest dev version if you clone/install from source. We are currently waiting for this PR to be merged and we should be good to go! I recommend that you test the exclusion zone functionality first to make sure it accomplishes what you need before it goes live and provide any feedback.

hmcoservit commented 3 years ago

Thanks @seanlaw , I will check it out asap.

seanlaw commented 3 years ago

@hmcoservit Just an FYI that the new version was released today!

hmcoservit commented 3 years ago

Hi @seanlaw thanks for informing me, sorry that I was too occupied to check it out before the release (holidays here). I will share my feedback as soon as I get the chance ;)

hmcoservit commented 3 years ago

@seanlaw Hi Sean, I got the new update and thanks for that.

I now have the functionality that I need with:

from stumpy import core, config
config.STUMPY_EXCL_ZONE_DENOM = np.inf  
stream = stumpy.stumpi(T_full, m, egress=True, normalize=False)
seanlaw commented 3 years ago

Hopefully, that feels pretty natural? Since it is exceptionally rare for users to have to change the exclusion zone, we thought that it is "better" to set the exclusion zone globally rather than than at the function level.

hmcoservit commented 3 years ago

@seanlaw Good decision, I think its okay. I will continue to run tests around my use case, I will keep you posted of any interesting findings.