TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.63k stars 317 forks source link

Add Matrix Profile XXI MERLIN Algorithm for Discord Detection #417

Open seanlaw opened 3 years ago

seanlaw commented 3 years ago

Discussed in https://github.com/TDAmeritrade/stumpy/discussions/415

Originally posted by **udf2457** June 19, 2021 I was reading about MERLIN which was presented at ICDM 2020: - Matrix Profile XXI: MERLIN: Parameter-Free Discovery of Arbitrary Length Anomalies in Massive Time Series Archives. Takaaki Nakamura, Makoto Imamura, Ryan Mercer and Eamonn Keogh. ICDM 2020 [[pdf](http://www.cs.ucr.edu/~eamonn/MERLIN_Long_version_for_website.pdf),[author's Matlab reference implementation]( https://sites.google.com/view/merlin-find-anomalies)] It would be really neat to see MERLIN implemented in `stumpy`
NimaSarajpoor commented 3 years ago

@seanlaw @udf2457

I tried to implement the Table I and Table II of the paper, referred to as DRAG algorithm by the authors. The DRAG algorithm will be used later in the MERLIN algorithm presented in Table III.

There are some confusing parts about the algorithm. I tried to check out both the paper and the MATLAB code. I am going to share my understanding with you:

In contrast to what presented in the Table I of the paper, the MATLAB code has a break in the for loop (please see below):

image

It changes the output of the Table I algorithm, known as the first phase of DRAG. However, the output of the Table II algorithm, the second phase of DRAG, will remain unchanged (at least for the top-10 discords )

============================================================

In Table II, there is dj variable, but no definition is provided for this variable. So, I took a look at the MATLAB code and I realized that dj is the distance of a discord to its Nearest Neighbor. And, it should be stored and updated as we move from one subsequence to another through the outer loop.

To keep the same structure for the loops as presented in Table II, I did a similar thing. The code is not efficient as I tried to get the result first and then refine the code. I return both the ndarray that contains discord info and also the indices of the discords that should be removed.

===========================================================

I submitted PR #418.

@udf2457: If you are planning to work on this issue, I can stop and continue on my other projects.

udf2457 commented 3 years ago

@Ninimama thank you for your efforts so far, very much appreciated. I will take a look through your post and comment if I think I have any insight to offer.

In the mean time, in relation to your final comment, I am merely here as an observer/commenter. As per my comment to @seanlaw in the discussions section:

Have a fair bit on at $work at the moment, so can't make any promises, but if I get the opportunity I'll certainly try to drop by now and then.

Apologies I can't be more hands-on.

udf2457 commented 3 years ago

@Ninimama At a glance I'm uncertain as to why that break needs to be there. From the verbose description in the paper:

However, if any items in the set C are less than r from the subsequence under consideration, we know that they could not be discords, thus they are admissibly pruned from the set.

So the test should affect the candidate not the loop in general.

It also looks a little strange to break out of a loop when the loop iteration ends on the next line(s) anyway ? (Unless I'm mis-interpreting the matlab code)

seanlaw commented 3 years ago

@Ninimama Considering that it is the same primary authors as the geometric chains paper, were you able to reproduce any of the figures from the paper?

Is the break correct/necessary and it was simply missed in the paper?

udf2457 commented 3 years ago

Doing a bit of digging around it looks like there may be two versions of the code on the author's website.

On the "Documentation" page under "Code Link" is the version that I referred to originally and that @Ninimama is referencing in this discussion.

However if you download the zip under "Introduction Link" , there are Matlab files in there with a timestamp of 6 August 2020.

Unfortunately Google Docs don't seem to provide a way to determine the timestamp of the standalone code file.

I'm thinking perhaps the standalone code was proof-of-concept and the code in the zip is a later version ?

(N.B. I'm just "thinking aloud here", I'm still trying to figure what the difference is as the code looks like its substantially different).

NimaSarajpoor commented 3 years ago

@seanlaw @udf2457 I tried the code with and without break.

If we don't use break, the number of discords (before being pruned in the second phase of the algorithm) is higher(!) compared to the number of discords discovered when we use break in the loop.

However, both cases lead to the same number of discords after being refined via the second phase of the algorithm presented in Table II. Also, I checked out the top-10 discords, and they are the same.

Since there is no break in the loops of the second phase of the algorithm, any discord that was chosen by mistake will be removed. So, it seems there should be no problem.

=================================================

were you able to reproduce any of the figures from the paper?

Not yet. I will try and will push a commit if I can reproduce at least one of the figures (for now).

=================================================

However if you download the zip under "Introduction Link" , there are Matlab files in there with a timestamp of 6 August 2020.

Didn't notice that. We can check both files if we get into trouble later during the implementation.

=================================================

To sum up, I think things should be fine. I will go and see if I can implement Table III (MERLIN algorithm) and reproduce one of the figures.

NimaSarajpoor commented 3 years ago

@seanlaw:

Apparently, the authors didn't use Matrix Profile in their algorithms. In fact, if you read the first paragraph of the section II-C of the paper, it says they are not going to use Matrix Profile.

However, on the Eamon's webpage, it shows Matrix Profile XXI-MERLIN (which might mean the authors used matrix profile. Right?). Am I missing something here?

If not, are we planning to (somehow?) use Matrix Profile?

UPDATE:

I read the discussion #415. So, are we going to implement the MERLIN without using MP directly?

seanlaw commented 3 years ago

@seanlaw:

Apparently, the authors didn't use Matrix Profile in their algorithms. In fact, if you read the first paragraph of the section II-C of the paper, it says they are not going to use Matrix Profile.

However, on the Eamon's webpage, it shows Matrix Profile XXI-MERLIN (which might mean the authors used matrix profile. Right?). Am I missing something here?

If not, are we planning to (somehow?) use Matrix Profile?

Right, it does not use matrix profiles at all. It is somewhat related to some of the goals of STUMPY but, coincidentally, MERLIN would be an anomaly if it were to be added to STUMPY

udf2457 commented 3 years ago

@seanlaw: Right, it does not use matrix profiles at all. It is somewhat related to some of the goals of STUMPY but, coincidentally, MERLIN would be an anomaly if it were to be added to STUMPY

I would reason it as being a useful addition to the STUMPY toolbox because if all someone is using STUMPY for is looking for discords, then it saves them having to go elsewhere or add yet-another-library to their stack.

Then, as/when/if "true" matrix profile based algo comes out that is as efficient for discords as MERLIN then you can at that future point in time of course consider dropping MERLIN in favour of keeping to the matrix profile system overall.

udf2457 commented 3 years ago

@Ninimama

FYI My mind kept occasionally wondering back to the break and so in a moment of madness I decided to try to reach out to the paper's authors.

I received a surprisingly quick reply. Unfortunately the person who wrote the reference Matlab code is long gone (graduated and might not have a timely means of contact) however the summary of the break logic provided by the co-author was as follows:

Here is my understanding: I believe line 6 of Phase II in the paper corresponds to lines 258-263 in the matlab code. LIne 6 in the pseudo-code summarized a step by assuming the reader can provide a distance function for the two subsequences.

Lines 258-263 in the matlab code implement the distance measure. The distance dist2 for the current candidate subsequence to another candidate is calculated element by element (for L elements). A discord is defined as the subsequence with the largest nearest neighbor distance. If we have a best-so-far nearest neighbor distance nndist2, then we are not interested in continuing to measure dist2 if it is greater than nndist2 since it will not end up being nearer than nndist2. This is why the calculation is terminated early with the break.

I see that there is also a break on line 202 within Phase I. I looks like the break was left out in order to simplify the pseudo code. The break will not affect the result, but it terminates the loop once the candidate is ruled out.

Hat-tip to the team at Department of Computer Science and Engineering, University of California Riverside

NimaSarajpoor commented 3 years ago

Here is my understanding: I believe line 6 of Phase II in the paper corresponds to lines 258-263 in the matlab code. LIne 6 in the pseudo-code summarized a step by assuming the reader can provide a distance function for the two subsequences.

Lines 258-263 in the matlab code implement the distance measure. The distance dist2 for the current candidate subsequence to another candidate is calculated element by element (for L elements). A discord is defined as the subsequence with the largest nearest neighbor distance. If we have a best-so-far nearest neighbor distance nndist2, then we are not interested in continuing to measure dist2 if it is greater than nndist2 since it will not end up being nearer than nndist2. This is why the calculation is terminated early with the brea

Yes, that break in the MATLAB code of the phase 2 is okay as it tries to stop calculating the distance between two subsequences when the distance becomes bigger than the NN distance.

I looks like the break was left out in order to simplify the pseudo code. The break will not affect the result, but it terminates the loop once the candidate is ruled out.

I don't know what the "result" is. If they mean the outcomes of the phase-II (i.e. the final outcome of the DRAG algorithm), then they are correct. However, it does make a difference in the outcomes of the phase -I (and I think that doesn't matter). If the author meant there is no difference in the outcome of the phase-I, I have to go and checkout my code again to see if I miss anything or not. I will try to reproduce a figure and if the result was not good, I will go and check out my code again.

Thanks for your effort and follow up. Appreciate it.

NimaSarajpoor commented 2 years ago

@seanlaw,

I am about to resume my work on MERLIN. If I remember correctly, we got the result but it was too slow as we used the paper's implementation rather than the updated version of the algorithm available on the supporting page.

If you have any update or suggestion regarding this, please let me know.

seanlaw commented 2 years ago

@seanlaw,

I am about to resume my work on MERLIN. If I remember correctly, we got the result but it was too slow as we used the paper's implementation rather than the updated version of the algorithm available on the supporting page.

If you have any update or suggestion regarding this, please let me know.

Yes, after reviewing your original Merlin PR, I think we concluded with:

If MERLIN3 is essentially producing the same results but it is drastically faster than the pseudocode in the paper then we should try to implement MERLIN3

To reiterate, it would be best to reproduce the results from the paper with MERLIN3 and be able to compare it with their implementation. Am I missing anything? @udf2457 Do you have anything to add?

udf2457 commented 2 years ago

@seanlaw Sorry for the delay.

I don't think you've missed anything and I have not really got anything useful to add to your comment. I would agree with @Ninimama that "same results drastically faster" would be the preferable route to go, but mostly goes without saying.