Open NimaSarajpoor opened 2 years ago
Idea 4: As discussed in VALMOD paper (see #585), it is possible to find lower bound (LB) of distance between two subsequences of length m+1
according to the distance of same subsequences (i.e. with same start index) but with length m
. In MERLIN, the algorithm calculates some of pairwise distances throughout the process. It might be useful to find the LB of the distances for length m+1
for those subsequences, and it might help to speed up calculation. Note that VALMOD algorithm increases the computation time and the memory usage. Therefore, for long time series, using VALMOD might be a challenge; however, we can take advantage of its concept in MERLIN to find top-k
anomalies for a range of subsequence length [min_m, max_m]
NOTE:
A recently-proposed algorithm DAMP
(see #606) claims to find anomalies in time series with trillions of data point! This algorithm might be able to find top-k anomalies in a range of subsequence length
as well. So, it would be great to study DAMP
, and see if there is a need to dedicate one module of library just for anomaly detection.
Just for the record, I think a very important difference between finding motif vs finding anomaly is that to find motif, one needs to find the exact distance between subsequence and each of its neighbors. In anomaly detection, we may disregard a subsequence AS SOON AS we find out that one of its neighbors has a distance that is below a (data-driven) threshold.
In anomaly detection, we may disregard a subsequence AS SOON AS we find out that one of its neighbors has a distance that is below a (data-driven) threshold.
Do you have an example? What do you mean by "one of its neighbors"? Are you referring to when one element within the subsequence is beyond some threshold?
Are you referring to when one element within the subsequence is beyond some threshold?
No. I was referring to the distance between two subsequence. More explanation is provided below.
What do you mean by "one of its neighbors"?
In the time series T
(with length n
), we have n - m + 1
subsequences of length m
. Each subsequence has n - m
neighbours. We define discord as a subsequence that has the largest distance to its FIRST nearest neighbour.
In MERLIN, we define a threshold, which basically says THAT "largest distance" MUST BE greater than the threshold. (And, if we set threshold==np.inf
, we find no discord). So, let's say my threshold is 100, and I would like to make a decision about subsequence S_i
to see whether that subsequence can/might be a discord or NOT. As I explore its neighbours, I find one that has a distance of, say 80, to S_i
. At this moment, when the distance between S_i
and its neighbour is found to be less than the threshold 100, I know for sure that S_i
cannot be a discord. Because I know that the distance between S_i
and its nearEST neighbour of S_i
is definitely <= 80
which is definitely less than threshold. So, I do not need to explore any further to make a decision about S_i
. I know it cannot be discord (according to the user-defined threshold).
@seanlaw Please let me know if I need to clarify something.
Thanks @NimaSarajpoor! I get your point now. It's SO subtle!
MERLIN algorithm is for anomaly detection in long time series data
T
. It is initially proposed in paper MERLIN.In this issue, we would like to provide some ideas on how to improve MERLIN algorithm. If you are new to this topic, you may want to review issue #417 and the PR #505 first before reading the rest of this issue.
Idea 1 (already implemented): using approx. matrix profile (obtained by
stumpy.scrump._prescrump
) to fast reject subsequencesIdea 2: updating approx. matrix profile (
approx_P
) : the idea is to simply updateapprox_P
as we go through the selection process. Then, we can see that the values ofapprox_P
is exactly the same asprecise P
for the finalized candidates. So, we can search among them to find discords. If nearest neighbor index is required by the user, then, we need to find the nearest neighbor of selected discords. It may also help us to better updater
with help of updatedapprox_P
.Idea 3 (!!)
reject_checkpoint_right/left
concept: Let us assume a candidate at indexi
(i.e.T[i:i+m]
) is being compared to all of its right neighbors. And, let us assume this candidate is rejected when its distance to its neighbor at indexj
becomes lower than a thresholdr
. (r
is the minimum distance a discord should have with all of its neighbors) This means the subsequenceT[i:i+m]
has distance higher thanr
to all of its right neighbors before indexj
. However, we may fail in finding discords. So, we need to re-do the whole process by using a lower value forr
. Now, here is the idea: We do not need to re-compareT[i:i+m]
with its right neighbors located beforej
, because their distance will definitely be still higher than the reducedr
and thus their comparison is redundant. In other words, comparingT[i:i+m]
with subsequences located at indices beforej
do not result in any rejection. So, they cannot help us with rejectingT[i:i+m]
(and narrowing down our search space). So, we just need to compareT[i:i+m]
with subsequences located at index>=j
. We can keep updating these checkpoints as we lower ourr
(either because we failed in finding candidates or we want to discover more discords) So, we are using some information from our previous iteration to speed up our current iteration with updatedr
.