TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.16k stars 290 forks source link

Incorrect Matrix Profiles for Constant Subsequences with Repeating Values #110

Closed jules-samaran closed 4 years ago

jules-samaran commented 4 years ago

Dear authors, thank you for your work on this project and for making it available.

I recently started to try using stumpy for a project that involves time series data where I'd like to compute matrix profiles. Inside my time series, sub-sequences with constant value of length as large as the window size are very frequent. I understand that you use z-normalized euclidean distance between sub-sequences hence having constant sub-sequences could be a problem and that's what I checked when I ran the following code:

m = 5
ano = np.array([0., 0., 0., 0., 0., 1.])
compare = np.concatenate((np.zeros(4*m), np.ones(m)))
stumpy.stump(T_A=compare, m=m, T_B=ano, ignore_trivial=False)
Out[37]: array([[3.1622776601683795, 0, -1, -1], [0.0, 16, -1, -1]], dtype=object)

As you can see the first entry of the output doesn't make much sense, there exists a sub-sequence in T_A rigorously identical to the first query sub-sequence in T_B but the output is not 0. No warning or error message is printed to warn the user of this behavior. For the second sub-sequence of T_B though the result is correct since it is not a constant sub-sequence. Do you have any recommendations in order to circumvent this problem since as I mentioned earlier my data is full of constant sub_sequences?

seanlaw commented 4 years ago

@jules-samaran Thank you for your question and for the nice reproducible test case. I've taken a quick look and something certainly seems strange. Give me a little time to investigate. It may be a bug.

seanlaw commented 4 years ago

Ahh, okay. I see what you mean by "constant subsequence" (i.e., all values within the subsequence are identical) and, indeed, it looks like the problem is (as you already pointed out) related to z-normalized Euclidean distance. For others who may read this (and a future note to myself), it's not possible to compute the z-norm for a constant sequence since the standard deviation would be zero and so you'd end up with something like a divide by zero situation but, in this case, no error is thrown due to the way that the z-norm is computed in STUMPY. I'm gonna guess that in those situations where there's divide by zero, STUMPY sets the denominator to 10^-10 in order to avoid a divide by zero and so the value of the z-norm Euclidean distance (between any two pairs of subsequences being compared) is at or near inf. Off the top of my head, I'm struggling with whether or not we can even catch this scenario at runtime and to warn the user. Perhaps, knowing the window size, one could do a quick scan across each time series to look for constant subsequences but I'll have to think about this more. This is tricky!

In your case, are you actually interested in being warned and then having STUMPY fail to return a matrix profile? Or, are you hoping that STUMPY (or some post processing) could give you the matrix profile but with constant subsequences corrected in the output?

Hopefully, I'm understanding your question correct but feel free to correct my understanding.

mihailescum commented 4 years ago

I think a better approach would be setting the denominator standard deviation to 1 when the standard deviation is (almost) zero. This way, constant subsequences only get shifted, but not rescaled.

Edit: Changing the denominator would not be the correct way to prevent rescaling, but resetting the standard deviations that are close to 0 should, see my answer below

seanlaw commented 4 years ago

@mexxexx Thanks for chiming in. I'll have to take a closer look but please note that, for reasons of computational efficiency, the z-normalized Euclidean distance is computed according to Equation 1 in this paper. I don't know how your suggestion might alter the answer and it may be outside the scope of STUMPY to handle.

seanlaw commented 4 years ago

@jules-samaran Do you know, apriori, if the constant subsequence always has the same value repeating throughout the time series?

mihailescum commented 4 years ago

@seanlaw Hi, yes you are right, just altering the denominator as I suggested will not work. But I think there still is an easy fix:

The equation you mentioned can be derived like here: https://stats.stackexchange.com/a/285125/262705 Suppose that one of the sequences, let's say u, (I'll stick to the notation used in that answer) is constant. Then I would suggest not dividing by the standard deviation but only subtracting the mean. But as the sequence is constant, we have u-1mu = 0. Therefore the distance squared between the z-normalized v and shifted u is equal to m. If both sub sequences were constant, their distance would be 0. If one of the subsequences is almost constant, I think the same reasoning makes sense. Implementing this should be pretty efficient using something along the lines of

D_squared[np.isclose(σ_Q, 0)] = m
D_squared[np.isclose(Σ_T, 0)] = m 
D_squared[np.isclose(Σ_T, 0) & np.isclose(σ_Q, 0)] = 0

as a post-processing step (probably this can be done better).

I can't answer the question you asked @jules-samaran, but I am also dealing with the problem of constant subsequences, and they can have different values repeating throughout the series.

seanlaw commented 4 years ago

@mexxexx Talking through a post-processing step, if we knew where all of the constant subsequences were (even if they had different constant values) along with whether or not a repeat of that sequence existed, then would it be sufficient to go back to the matrix profile and zero out those values and update the matrix profile index?

In the case where a constant subsequence is repeated multiple times within the time series, then the matrix profile index would point to the first possible instance of the repeat (save for the first time where the matrix profile index would point to the second instance of a repeat).

Note to self: We should also handle self-joins (i.e., exclusion zones) in addition to AB-joins.

mihailescum commented 4 years ago

I don't think that is enough, because there still might be the case that a non constant subsequence should have its closest match with a constant subsequence (e.g. some small noise). Right now, this case is not captured, as by dividing by zero all comparisons to constant subsequences are invalidated.

seanlaw commented 4 years ago

@mexxexx Ahhh, gotcha. That makes sense.

seanlaw commented 4 years ago

@mexxexx Would it be possible to find the constant sequences and add a very, very small amount of random noise before computing the matrix profile?

seanlaw commented 4 years ago

Would it be possible to find the constant sequences and add a very, very small amount of random noise before computing the matrix profile?

It seems like this doesn't really helps since, after you add the noise, each subsequence is still z-normalized and so this could still result in distances (comparing different subsequences) that are not so close to zero

jules-samaran commented 4 years ago

Thank for your inputs @seanlaw, your guess was correct: when the estimated standard deviation of the sub-sequence is 0 stumpy replaces it with 1e-10 to avoid dividing by zero. Yes in my situation my time series can have different constant sub-sequences whose unique values are not the same. The problem I have with centering the constant sub-sequences is that it gives a distance of zero for two constant sub-sequences with different values so I was wondering about this z normalization: is it just necessary to make the computations faster or is the motivation behind it to get rid of the scaling and shifting factor and just to compare sub-sequences based on their patterns/shapes?

seanlaw commented 4 years ago

@jules-samaran That is correct. The application of the z-normalization is in place because it simplifies the math and significantly improves the computational efficiency. This is why other types of distance measures would not perform as well. I'm not a time series expert but, usually, with other time series analysis methods, the data is required to be stationary. Matrix Profiles overcome this constraint by performing a local normalization rather than a global normalization. At the end of the day, matrix profile is trying to compare the shape of subsequences. As you've identified, two constant subsequences with different constant values should be considered "the same" in the spirit of a matrix profile. However, this is not entirely true since a constant subsequence would also have a zero standard deviation.

seanlaw commented 4 years ago

@jules-samaran May I ask what exactly it is that you are trying to accomplish in terms of analyzing your time series? I wonder if matrix profile might not be the best approach but that there may be a different/simpler solution that could address your problem

seanlaw commented 4 years ago

@mexxexx I have been reading more and I believe that your approach is probably correct (my apologies for misunderstanding your point initially). I will need to work through the math again, incorporate the change across various functions, and write some unit tests to make sure that we cover as many cases as possible (at least three).

@jules-samaran Do you think @mexxexx solution solves your problem?

seanlaw commented 4 years ago

@mexxexx and @jules-samaran The latest version of the source code on Github has the fix implemented as described by @mexxexx above. If you are able to, would you mind cloning the code and installing it to see if it solves your use cases? If so, then I can push out a new release to PyPI and conda-forge thereafter. If you are not comfortable with doing this then let me know and I can update the version on PyPI/conda-forge to the latest some time over the next couple of days.

Thanks in advance!

jules-samaran commented 4 years ago

@seanlaw I am looking at an anomaly detection problem therefore comparing the shape of sub-sequences is important but also the absolute value of a sub-sequence can be important and this information is lost when using matrix profile. However this is normal because if you want your comparison method to focus mainly on the shapes it makes sense to normalize and lose other information. This isn't your method's problem, it does what it's meant to do. It doesn't mean that I can't use it though and @mexxexx's idea is a good improvement. I'll clone and test in the next couple days and tell you how it's working for me. Thanks for the help !

seanlaw commented 4 years ago

@jules-samaran Awesome! I look forward to your update and hope that it serves your needs.

mihailescum commented 4 years ago

@seanlaw Thank you very much for your very quick responses! I have to admit that at the moment I'm not using stumpy, because I need the Scrimp++ algorithm described in the Matrix Profile XI paper. But I'll give it a try in the next couple of days also

jules-samaran commented 4 years ago

@seanlaw I've just cloned the source code and run it on the same examples that failed before and the results are normal now, thanks!