Improving MERLIN: ideas for future

NimaSarajpoor commented 2 years ago

MERLIN algorithm is for anomaly detection in long time series data T. It is initially proposed in paper MERLIN.

In this issue, we would like to provide some ideas on how to improve MERLIN algorithm. If you are new to this topic, you may want to review issue #417 and the PR #505 first before reading the rest of this issue.

Idea 1 (already implemented): using approx. matrix profile (obtained by stumpy.scrump._prescrump) to fast reject subsequences
Idea 2: updating approx. matrix profile (approx_P) : the idea is to simply update approx_P as we go through the selection process. Then, we can see that the values of approx_P is exactly the same as precise P for the finalized candidates. So, we can search among them to find discords. If nearest neighbor index is required by the user, then, we need to find the nearest neighbor of selected discords. It may also help us to better update r with help of updated approx_P.
Idea 3 (!!) reject_checkpoint_right/left concept: Let us assume a candidate at index i (i.e. T[i:i+m]) is being compared to all of its right neighbors. And, let us assume this candidate is rejected when its distance to its neighbor at index j becomes lower than a threshold r. (r is the minimum distance a discord should have with all of its neighbors) This means the subsequence T[i:i+m] has distance higher than r to all of its right neighbors before index j. However, we may fail in finding discords. So, we need to re-do the whole process by using a lower value for r. Now, here is the idea: We do not need to re-compare T[i:i+m] with its right neighbors located before j, because their distance will definitely be still higher than the reduced r and thus their comparison is redundant. In other words, comparing T[i:i+m] with subsequences located at indices before j do not result in any rejection. So, they cannot help us with rejecting T[i:i+m] (and narrowing down our search space). So, we just need to compare T[i:i+m] with subsequences located at index >=j. We can keep updating these checkpoints as we lower our r (either because we failed in finding candidates or we want to discover more discords) So, we are using some information from our previous iteration to speed up our current iteration with updated r.

NimaSarajpoor commented 2 years ago

Idea 4: As discussed in VALMOD paper (see #585), it is possible to find lower bound (LB) of distance between two subsequences of length m+1 according to the distance of same subsequences (i.e. with same start index) but with length m. In MERLIN, the algorithm calculates some of pairwise distances throughout the process. It might be useful to find the LB of the distances for length m+1 for those subsequences, and it might help to speed up calculation. Note that VALMOD algorithm increases the computation time and the memory usage. Therefore, for long time series, using VALMOD might be a challenge; however, we can take advantage of its concept in MERLIN to find top-k anomalies for a range of subsequence length [min_m, max_m]

NOTE: A recently-proposed algorithm DAMP (see #606) claims to find anomalies in time series with trillions of data point! This algorithm might be able to find top-k anomalies in a range of subsequence length as well. So, it would be great to study DAMP , and see if there is a need to dedicate one module of library just for anomaly detection.

Just for the record, I think a very important difference between finding motif vs finding anomaly is that to find motif, one needs to find the exact distance between subsequence and each of its neighbors. In anomaly detection, we may disregard a subsequence AS SOON AS we find out that one of its neighbors has a distance that is below a (data-driven) threshold.

seanlaw commented 1 year ago

In anomaly detection, we may disregard a subsequence AS SOON AS we find out that one of its neighbors has a distance that is below a (data-driven) threshold.

Do you have an example? What do you mean by "one of its neighbors"? Are you referring to when one element within the subsequence is beyond some threshold?

NimaSarajpoor commented 1 year ago

Are you referring to when one element within the subsequence is beyond some threshold?

No. I was referring to the distance between two subsequence. More explanation is provided below.

What do you mean by "one of its neighbors"?

In the time series T (with length n), we have n - m + 1 subsequences of length m. Each subsequence has n - m neighbours. We define discord as a subsequence that has the largest distance to its FIRST nearest neighbour.

In MERLIN, we define a threshold, which basically says THAT "largest distance" MUST BE greater than the threshold. (And, if we set threshold==np.inf, we find no discord). So, let's say my threshold is 100, and I would like to make a decision about subsequence S_i to see whether that subsequence can/might be a discord or NOT. As I explore its neighbours, I find one that has a distance of, say 80, to S_i. At this moment, when the distance between S_i and its neighbour is found to be less than the threshold 100, I know for sure that S_i cannot be a discord. Because I know that the distance between S_i and its nearEST neighbour of S_i is definitely <= 80 which is definitely less than threshold. So, I do not need to explore any further to make a decision about S_i. I know it cannot be discord (according to the user-defined threshold).

@seanlaw Please let me know if I need to clarify something.

seanlaw commented 1 year ago

Thanks @NimaSarajpoor! I get your point now. It's SO subtle!

TDAmeritrade / stumpy

Improving MERLIN: ideas for future #584