Handling Noise in Relatively Flat Data

seanlaw commented 4 years ago

A recent paper dealing with the matrix profile has identified that

when comparing subsequences that are relatively flat and contain noise, the resulting distance is high despite the visual similarity of these subsequences

This paper offers a simple way to correct for this by using the maximum stddev between the two subsequences and then squaring it instead of multiplying the stddev from each subsequence together (see Algorithm 1 in the paper above).

This may be a feature worth considering in the future

mihailescum commented 4 years ago

How would you expose this feature in the API? I played around with the paper and have a working version, but I'm unsure of the best way to implement it. My guess extend stump by a new parameter called noise_stddev that could be either a float or an array of length of the matrix profile that would indicate the noise standard deviation for every subsequence. What do you think about this? And furthermore, how would you implement unit tests?

seanlaw commented 4 years ago

@mexxexx I've only skimmed the paper so please understand that I haven't put a lot of thought into this yet and that your ideas are likely better. At the end of the day, it seems like the only thing we really need to do is add a "corrected distance" calculation (according to Algorithm 1 in the paper).

So, next to the core._calculate_squared_distance (a new function that takes only scalar inputs), we can add a similar core._calculate_corrected_squared_distance function that implements the corrected distance and then all high level API functions (i.e., stump, stumped, etc) can have a corrected=False argument. When corrected=True (i.e., stumpy.stump(T, m, corrected=True), this gets passed down to sub-functions and there would be an if/else statement that would decide whether core._calculate_squared_distance (when corrected=False) or core._calculate_corrected_squared_distance would get called (when corrected=True).

This way, we wouldn't really need to unit test the concept of "noisy flat data" (I mean, we would still add a couple of simple tests for sanity) but all we really need is to confirm that core._calculate_corrected_squared_distance follows the paper precisely and I think that would be it. 🤷‍♂️

I'm sure that I'm missing something due to not having read the paper fully. What are your thoughts?

mihailescum commented 4 years ago

But correction is not a boolean decision in the paper. It depends on the standard deviation of the noise (it is assumed that it is normal), which is why I suggested passing a float/array element to the function instead.

seanlaw commented 4 years ago

I see! Thank you for the added information. So, noise_stddev would be a threshold value? What would be the default (off) value?

How would this affect unit testing? It sounded like you had some concerns.

Btw, this issue does not necessarily imply that this feature is good to add. If your exploration shows that there is no real benefit then I’d rather focus our energy elsewhere. Your feedback is welcomed.

mihailescum commented 4 years ago

I see! Thank you for the added information. So, noise_stddev would be a threshold value? What would be the default (off) value?

No, it isn't really a threshold. If I remember the paper correctly, you look at two subsequences, that can be written as X=S+N and Y=S+N' where S is a base signal and N, N' represent the noise (normally distributed around zero with variance noise_stddev, and you look at their z-normalized distance. Now in this paper they want to remove the effect of the noise term, because the logic is that X and Y have the same base signal S so their distance should be zero, as would be the case without noise.

They do this correction by computing the expected distance between X and Y, which is dependent on the standard deviation of the noise, noise_stddev, and subtracting this values from the distance to get the "corrected distance". The reasoning is that, in expectation, by this procedure the corrected distance between X and Y will be zero if we know the right standard deviation of the noise.

I think it is an interesting idea, and it seemed useful. I applied it to data which was mostly flat with noise, so as they point out in the paper the matrix profile was completely useless. I think that if somebody is actually expecting patterns or anomalies in such data it might be helpful.

How would this affect unit testing? It sounded like you had some concerns.`

Probably it shouldn't be that special, if you just implement it in the naive mass algorthm, as it is a deterministic algorithm, my only concern is that we really have to pay attention to this as to not have mistakes in the testing framework.

seanlaw commented 4 years ago

@mexxexx If this interests you then I say "go for it!"

However, for myself, I think that this one is a "nice to have" feature and, until I've completed more core tasks, I'd rather spend my time focused on other priorities. In other words, there is no rush but you have my full support if you'd like to pursue it

seanlaw commented 4 years ago

@mexxexx It looks like the original link no longer works and I’ve updated the think but unsure if it is pointing to the right paper. Do you happen to remember what the original paper is and if the updated link looks correct?

seanlaw commented 4 years ago

I think this is beyond the scope of STUMPY so closing this for now.

mihailescum commented 4 years ago

Just for completeness the link to the original paper: https://biblio.ugent.be/publication/8605188/file/8605190.pdf (or uploaded here)

seanlaw commented 2 years ago

@NimaSarajpoor Played with this a bit more over here so I'm reopening in case there is interest.

NimaSarajpoor commented 2 years ago

@seanlaw @mihailescum

A recent paper dealing with the matrix profile has identified that

when comparing subsequences that are relatively flat and contain noise, the resulting distance is high despite the visual similarity of these subsequences

An extend version of this paper is: Implications of Z-Normalization in the Matrix Profile. The formula provided in the algorithm1 of the extended version has a typo. The term (2 + m) should be replaced with (2 + 2 * m), as shown in eq(7) of the paper, or in section 3.2 of their initial work.

How would you expose this feature in the API? I played around with the paper and have a working version, but I'm unsure of the best way to implement it. My guess extend stump by a new parameter called noise_stddev that could be either a float or an array of length of the matrix profile that would indicate the noise standard deviation for every subsequence. What do you think about this?

Note that the formula provided in the algorithm1 requires \sigma_noise as one of its input. I THINK \sigma_noise is for the whole time series. We can have two approaches for adding this new feature to the library:

Approach (I): Add parameter handle_noise (with type bool, default False) to stumpy.stump. If the parameter is set toFalse, then sigma_noise will be set to 0, and if the parameter is set to True, the sigma_noise can be set to config.SIGMA_NOISE, which, by default, is set to 5th percentile of all standard deviation values of all subsequences. (see section 4.2 in the paper)
Approach (II): Add parameter simga_noise (with type float, default 0). Then the user can change it. If they do not know how to estimate it, one idea is to let them know about 5th percentile method in the docstring so they can go and calculate it themselves.

I think the second approach is better as it somehow tells the users that they should know what they are doing. The parameter with type bool might give the users a false hope that the noise can be magically handled! And, I think that can be dangerous.

I think it is an interesting idea, and it seemed useful. I applied it to data which was mostly flat with noise, so as they point out in the paper the matrix profile was completely useless. I think that if somebody is actually expecting patterns or anomalies in such data it might be helpful.

Correct. As shown here, eliminating the noise can be useful. More investigation in this regard would be helpful.

seanlaw commented 2 years ago

@NimaSarajpoor Would a default value of 0.0 give us the original z-norm distance? I prefer short parameter names and would prefer noise over sigma_noise.

Frankly, nobody has really asked about this. It is likely an extreme edge case for users

NimaSarajpoor commented 2 years ago

Would a default value of 0.0 give us the original z-norm distance?

That is correct. With default value of 0.0, we will get the original z-norm distance.

I prefer short parameter names and would prefer noise over sigma_noise.

Sure. That makes sense!

Frankly, nobody has really asked about this. It is likely an extreme edge case for users

You are right! And, I think the alternative approach (i.e. increasing the window size m) is not bad. You have more insight on the experience of users in using STUMPY. So, please feel free to close this issue again if you think it unnecessarily increases the computation time.

NimaSarajpoor commented 2 years ago

Also, if you want to know how it would look like in the implementation, it would probably be something like:

pearson_{i,j} = pearson_{i,j} + noise * constant_value * A

# where, A = ( 1 / max(sigma_i, sigma_j) ) ^ 2

Vincent-SV commented 1 year ago

Hello and thank you all for your work on this project !

I am interested in the papers discussed in this issue because I think they may help me when analyzing my data. Also I want to share my insights on the "implications of z-normalization" as it could shed some light on another recurrent issue, the handling of constant subsequences.

The explanation turned out to be quite verbose, I hope that I chose the correct level of details and complexity and that it can be of help, and once again I thank you for your great work with matrix profiles.

To understand why constant and almost-constant subsequences are difficult to handle, I want to address the incorrect visualization that the space of z-normalized subsequences is R^m \ {0} . Actually it is more like a circle or a sphere.

Let's start with subsequences of length m=3. The space we start with is R^3, and let's name the 3 points/coordinates in the sequence x, y and z.

The first step is to substract the mean of the subsequence. We subtract the same amount to x, y and z until x+y+z=0. It is a projection on the plane x+y+z=0 parallel to the axis x=y=z. example of projection We can note that during this projection, many sequences will land on the same points : the sequences that are just shifted versions of another are now considered equal. We have now a subspace of dimension R^{m-1} and all of our subsequences have now a mean of 0. This first step is very natural, the effect is that sequences that look the same but shifted by some amount will be considered close.

The second step (and crucial step for this issue) is to divide by standard deviation. Because the mean is 0, the standard deviation is sqrt(x^2+y^2+z^2) and is the same as the euclidian distance between our subsequence and the origin, and the same as the norm of our subsequence. This operation is a "normalization", by dividing by the norm we don't change the direction of our vector (or the shape of our subsequence) but we change the length of the vector (or amplitude of the subsequence) so that the norm is one. All sequences land on x^2+y^2+z^2 = 1, which is a circle (remember we are on the plane x+y+z=1). example of normalization Once again, we go down a dimension, and many subsequences will land on the same points : subsequences that are scaled version of another are now considered equal. All of our subsequences have now a mean of 0 and a norm/standard deviation of 1. This second step is a good idea in many cases, the effect is that sequences that look the same but scaled up or down will be considered close.

To generalize to higher dimensions (m > 3) we start with the space R^m, then we have a projection on a subspace akin to R^{m-1}, and then a projection on the unit sphere of that subspace.

I believe there is a third step in your implementation with a multiplication by sqrt(m), so we land on a sphere of radius sqrt(m) instead of radius 1. It should not have any effect when comparing subsequences of fixed length, because all the distances are multiplied by a fixed amount.

The downside with the second step is that subsequences close to the origin (= with very little standard deviation, almost constant), can be close together before the normalization and very far apart after the normalization. Moreover, the constant subsequences can't be normalized (can't divide by standard deviation 0), and we could say that we don't know in which direction of the sphere we should put them.

A solution is to not normalize constant subsequences and leave them at the origin. In this case, they will be all alone at the center of the sphere, at a constant distance from every non-constant subsequence (the distance is sqrt(m) in your case I believe). But now the almost constant subsequences are quite far from the constant subsequences (after normalization), and that could lead some problems for example when there is rounding errors (?) So we can set a threshold for "almost constant subsequences" that will be set to the origin, but now the subsequences just under the threshold are far apart from the subsequences just above the threshold.

As you can see, it is an annoying problem with no easy fix, and there will always be some subsequences extremely close before the normalization and quite far apart after the normalization. The threshold should maybe be chosen to mitigate rounding errors problems but we have to know that there will likely be some annoying edge cases. The discontinuity in the space of z-normalized subsequences makes it difficult to have something that behaves nicely.

NimaSarajpoor commented 1 year ago

@Vincent-SV I really enjoyed your explanation! I think the "projection" process helped a lot!

You are right regarding the edge cases. Even if we try to use a threshold, we will have edge cases. One simple way to get that edge case is to scale down some subseqs by a great amount, say 1,000,000, so, there will be almost flat subseq and that may be that edge case.

Vincent-SV commented 1 year ago

Comparing the alternatives for distances :

The z-normalized distance does very well with (non-flat) patterns even if they are shifted and/or scaled up or down. The constant sections are not well handled, and there will always be some annoying edge cases. The almost flat noisy sections will effectively be projected on a random point of the sphere/space of z-normalized subsequences. If m is quite high (>= 50), the m-2 dimensional sphere is a surprisingly large space, so a random point on this sphere will land at a medium distance of everything else. This gives us the medium portions of the matrix profile. If m is quite low, (<=10), the sphere will be quite crowded and we will see some good matches between random subsequences, and that will mess up our analysis. When handling flat noisy subsequences, this difference between m being small or big can be quite surprising. I'm not sure that the current mode of operation (flat -> random and random -> medium distance) is really satisfying, because if we are unlucky or using m small we can find some random motifs in the noise. Using bigger m is not always an option if we have low temporal resolution and we don't want motifs to overlap. What could be interesting is if all flat subsequence were close to eachother, it would constitute one big motif relatively easy to remove, and then we could study the other motifs. (The flat motif could even be a default motif)

The non-normalized distance puts all almost-constant subsequences pretty close together (including constant subsequences), does not suffer from discontinuities and handles these issues pretty well. On the other hand, if there is a pattern repeating but scaled up a bit (even 1.1x), if it has a good amplitude the match will be very bad, worse than relatively flat subsequences of totally different shapes. This approach does not work well with patterns that are repeating but scaled a bit up or down.

About parameter p : when we put p close to 1, we have more emphasis on the small inaccuracy of the match in the sense that many small differences between 2 subsequences can make a big difference in the end. When we put p towards infinity, we dismiss the small differences to the point that only THE biggest difference between the 2 subsequences is relevent. p=2 is a middle ground, where small differences count for very little but many medium differences can outweigh one single big difference. The points above about normalization and non-normalization still stand.

The complexity invariant distance (CID) (from what I understand about it) still has some normalisation problems with constant subsequences and division by 0. The normalization step is the same projection on the same sphere than described above, with an extra step to accentuate the distance if the complexity of the 2 subsequence is different. That means that 2 similar sequences will be further apart if one is noisy and the other is not. It will most likely not solve the issues with flat noisy data.

The noise correction proposed in the articles above looked interesting to me. It decreases the distance between all subsequences, but more so between flat subsequences. It can lead to negative distances but the intended way of handling this case is to set all negative distances to 0. This means that there will be a lot more perfect matches, matches within noise distance of another will be considered perfect matches. The triangular inequality does absolutely not apply so watch out for that, and we can have d(a, b) = 0, d(b, c) = 0 but d(a, c) != 0 so watch out for that as well. On the bright side, the flat not-too-noisy sequences will perfectly match the constant sequence, and there is no added discontinuity in this technique, so I believe it could help mitigate the discontinuity issues from the z-normalized space. If there is no discord in the data, the matrix profile could turn out to be a flat 0, but analyzing the motifs (via the distance profile ?) would still show the sequences that are close together, (= interesting motifs).

Vincent-SV commented 1 year ago

Also to add a bit about "noise". I believe that if we are using real-world data there will always be some noise, even if it is very small. Even the mathematical cos(n/100) on multiple periods has has no perfect matches because of some "noise" coming from non-perfect sampling. I would argue that the problem is not really the noise, it is the flat subsequences because it will amplify the differences. In that sense, the "noise" handling is very closely related to the constant subsequences handling.

Some data I work with is indoor temperatures or even fridge temperature and it has often flat subsequences (with a bit of noise). Maybe choosing bigger m could solve some of my issues but then the actual time window would be too long for what I want to do. Maybe higher sampling frequency could also solve some of my issues but not doing so is preferable.

I guess I am doing a verbose +1 for this technique, but I'm not even sure I will use it if you implement it so don't take it too seriously just yet. Still I think it is a promising technique to give better mathematical properties and better flat sequence handling to the whole matrix profile approach, and anyway I hope you found the explanations interesting and helpful !

NimaSarajpoor commented 1 year ago

I would argue that the problem is not really the noise, it is the flat subsequences because it will amplify the differences. In that sense, the "noise" handling is very closely related to the constant subsequences handling.

I think you are correct here. @seanlaw is trying to handle this issue by adding param T_subseq_isconstant, where user can pass Boolean array or custom function (e.g std is less than a value). I think it is actually addressing what you are saying and it handles (almost) constant subseqs.

seanlaw commented 1 year ago

I think you are correct here. @seanlaw is trying to handle this issue by adding param T_subseq_isconstant, where user can pass Boolean array or custom function (e.g std is less than a value). I think it is actually addressing what you are saying and it handles (almost) constant subseqs.

Indeed. Given that figuring out whether a subsequence should be "constant" or not can be nuanced, we've elected to give the user full control/responsibility over making that decision. The parameter T_subseq_isconstant resolves all of this and, once added, STUMPY makes no assumptions on what is constant or not except that a subsequence with a zero stddev is constant. If the user disagrees, they can try to override this basic behavior. It completely removes the guess work for us

Vincent-SV commented 1 year ago

I reread the issues #591 and #695

IMHO giving the user the responsibility to specify the constant subsequences is a nice way to make the code flexible but is pushing a weakness/difficult point of the matrix profile back to the user. The user will (in most cases) have to find a way to do this automatically, and

it is an annoying problem with no easy solution as we saw above
it gets in the way of "given the matrix profile, most time series data mining problems are trivial to solve in a few lines of code" because the user needs to code a solution to this annoying problem.

Moreover, the user will (maybe) have to deal with edge cases and the boolean nature of the choice (put a sequence as constant, or amplify the noise by a lot) can be annoying.

The fact that the noise cancellation technique replaces this boolean choice with a continuous distance correction makes it very interesting to me. The noise_stddev choice is a bit like choosing a threshold for constant subsequences, but more nuanced (because it will not be a boolean effect) and it can have a sensible default (5th percentile of all standard deviation values of all subsequences) so that the problem is entirely handled for the user in most cases.

(I see the noise_stddev choice as a way to add a bit of blur/leeway when comparing two subsequences, with a cool effect of removing most problems with flat subsequences)

Edit : On the other hand, I have not experimented with it, once again I see it as a really promising technique but I can't be sure it will totally solve this annoying discontinuity, I would wait for more feedback to (maybe) back me up and push for (or against) this technique.

seanlaw commented 1 year ago

@Vincent-SV You bring up valid points and this is exactly the kind of discussion/deeper line of thinking that we need before we consider how might we support this feature in STUMPY (or, perhaps, not support it if it, say, only works for toy examples). To be clear, the point of the T_subseq_isconstant work isn't to address this (noisy data) issue but, instead, it is trying to help disambiguate/separate/distinguish the basic case when there are subsequences that are perfectly constant (which many users ask about) from the harder/ambiguous/nuanced edge cases where subsequences are relatively flat (but not perfectly constant). As you can see, you may be only the 4th person that we know of to have commented on this issue but this is very useful feedback! Please let us know what you find (especially if you think it is NOT a good idea).

aayush-se commented 1 week ago

Hi everyone, I'm an engineer at Sentry.io (also open source!) and our team is using Stumpy for anomaly detection for alerting. Please feel free to check it out here: https://github.com/getsentry/seer.

We were also running into this issue of noisy data impacting the results of our algorithm and found this paper too. I think what @seanlaw suggested with the noise_stddev option as a float is the way to go specifically offering once of the two methodologies that @NimaSarajpoor suggested:

Approach (I): Add parameter handle_noise (with type bool, default False) to stumpy.stump. If the parameter is set toFalse, then sigma_noise will be set to 0, and if the parameter is set to True, the sigma_noise can be set to config.SIGMA_NOISE, which, by default, is set to 5th percentile of all standard deviation values of all subsequences. (see section 4.2 in the paper)

Approach (II): Add parameter simga_noise (with type float, default 0). Then the user can change it. If they do not know how to estimate it, one idea is to let them know about 5th percentile method in the docstring so they can go and calculate it themselves.

I actually think the first approach would be a better UX for most users as they can adjust it as needed, but don't have to reinvent the wheel for the 5th percentile method if not.

Would love to see this implemented if not already and can assist in this process!

seanlaw commented 1 week ago

@aayush-se Thank you for sharing your comments!

It's been a while since I've looked at the paper but one other consideration is how noise affects the calculation when the distance is non-normalized (i.e., without z-normalization)? How does the match change? Given that STUMPY allows for Minkowski distances, for consistency, we can only add additional new noise-reduction parameters that support Minkowski distances. We must be thorough. Otherwise, it may not be worth pursuing(?).

Having said that, we welcome PRs for review (even if it is to convey/document the "how" and not meant to be merged). At this point, I think we need a notebook that reproduces the original published work.

seanlaw commented 1 week ago

@JaKasb had also provided some excellent points in the past on this topic:

I also expiremented the with the noise compensation method: Many aspects in the realm of matrix profiles algorithms suffer from instability in edge cases:

How to choose sigma_n ? Set sigma_n too small and the correction has no effect. Set sigma_n too large and the correction overshoots and results in negative euclidean distances. Even with reasonable sigma_n, the compensation needs safeguard measures to protect against negative distances. I derived a supremum for sigma_n, by assuming Pearson Correlation of -1 and solving for sigma_n in E[D²] = max_possible_distance²

_denom = np.maximum(sigma_i, sigma_j)**2
# https://www.wolframalpha.com/input?i=solve+%282*m+%2B+2%29+*%28x%5E2%29%2F%28d%29+%3D+4m+for+x+and+d%3E0+and+m%3E0
supremum_sigma_n = np.sqrt(2) * np.sqrt(_denom*m/(m+1))
sigma_n = np.clip(sigma_n, a_min=0, a_max=supremum_sigma_n)
E_d_squared_max = (2*m + 2)*(supremum_sigma_n**2)/_denom
assert np.allclose(E_d_squared_max, 4*m)

However, this upper limit on sigma_n is useless in practice, because the expected value of the pearson correlation is ca 0, as opposed to -1.

Assumptions The method flatout assumes Gaussian Noise, however constant subsequences often stem from quantization errors. The probability density function of the quantization error is Uniform instead of Gaussian. https://www.allaboutcircuits.com/technical-articles/quantization-nois-amplitude-quantization-error-analog-to-digital-converters/

TDAmeritrade / stumpy

Handling Noise in Relatively Flat Data #127