TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.17k stars 292 forks source link

Improving MERLIN: ideas for future #584

Open NimaSarajpoor opened 2 years ago

NimaSarajpoor commented 2 years ago

MERLIN algorithm is for anomaly detection in long time series data T. It is initially proposed in paper MERLIN.

In this issue, we would like to provide some ideas on how to improve MERLIN algorithm. If you are new to this topic, you may want to review issue #417 and the PR #505 first before reading the rest of this issue.

NimaSarajpoor commented 2 years ago

Idea 4: As discussed in VALMOD paper (see #585), it is possible to find lower bound (LB) of distance between two subsequences of length m+1 according to the distance of same subsequences (i.e. with same start index) but with length m. In MERLIN, the algorithm calculates some of pairwise distances throughout the process. It might be useful to find the LB of the distances for length m+1 for those subsequences, and it might help to speed up calculation. Note that VALMOD algorithm increases the computation time and the memory usage. Therefore, for long time series, using VALMOD might be a challenge; however, we can take advantage of its concept in MERLIN to find top-k anomalies for a range of subsequence length [min_m, max_m]


NOTE: A recently-proposed algorithm DAMP (see #606) claims to find anomalies in time series with trillions of data point! This algorithm might be able to find top-k anomalies in a range of subsequence length as well. So, it would be great to study DAMP , and see if there is a need to dedicate one module of library just for anomaly detection.

Just for the record, I think a very important difference between finding motif vs finding anomaly is that to find motif, one needs to find the exact distance between subsequence and each of its neighbors. In anomaly detection, we may disregard a subsequence AS SOON AS we find out that one of its neighbors has a distance that is below a (data-driven) threshold.

seanlaw commented 1 year ago

In anomaly detection, we may disregard a subsequence AS SOON AS we find out that one of its neighbors has a distance that is below a (data-driven) threshold.

Do you have an example? What do you mean by "one of its neighbors"? Are you referring to when one element within the subsequence is beyond some threshold?

NimaSarajpoor commented 1 year ago

Are you referring to when one element within the subsequence is beyond some threshold?

No. I was referring to the distance between two subsequence. More explanation is provided below.

What do you mean by "one of its neighbors"?

In the time series T (with length n), we have n - m + 1 subsequences of length m. Each subsequence has n - m neighbours. We define discord as a subsequence that has the largest distance to its FIRST nearest neighbour.

In MERLIN, we define a threshold, which basically says THAT "largest distance" MUST BE greater than the threshold. (And, if we set threshold==np.inf, we find no discord). So, let's say my threshold is 100, and I would like to make a decision about subsequence S_i to see whether that subsequence can/might be a discord or NOT. As I explore its neighbours, I find one that has a distance of, say 80, to S_i. At this moment, when the distance between S_i and its neighbour is found to be less than the threshold 100, I know for sure that S_i cannot be a discord. Because I know that the distance between S_i and its nearEST neighbour of S_i is definitely <= 80 which is definitely less than threshold. So, I do not need to explore any further to make a decision about S_i. I know it cannot be discord (according to the user-defined threshold).

@seanlaw Please let me know if I need to clarify something.

seanlaw commented 1 year ago

Thanks @NimaSarajpoor! I get your point now. It's SO subtle!