ensozos / Matrix-Profile

A Java library for Matrix Profile
https://ensozos.github.io/Matrix-Profile/
MIT License
19 stars 7 forks source link

How avoid NaN and Infinity values in the distance profile? #17

Open barrybecker4 opened 5 years ago

barrybecker4 commented 5 years ago

While adding more tests, I noticed that there are a number of cases where the distance part of the matrix profile can have Infinity and NaN values. One simple example (but not the only one) is a straight line.

If I run matrix profile on this straight line series of 10 y values

1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2

with a window of 5, then I get MP distance values of

NaN, NaN, NaN, POSITIVE_INFINITY, 3.1623, POSITIVE_INFINITY

And MP index values of

2.0000, 3.0000, 4.0000, 5.0000, 1.0000, 5.0000

There are 6 values in the profile (as expected). The length should be 10 – windowSize + 1 = 6. I don’t understand yet why there are NaN values. I was expecting all 0’s for the distances. Maybe it has to do with z-normalization.

I will look at the code more closely and try to make a proposal and perhaps PR to avoid these values.

barrybecker4 commented 5 years ago

The source of the problem seems to be movStd in Mass.java. The moving standard deviation can be 0 or even slightly less than zero in some cases (perhaps because of numerical error). Then later, at the very end of the mass method, we divide by the stdDev vector. If it has 0's (or NaN's) in it, then the result will have Infinity (or NaN's). I propose that we never let the values in the stdDev vector ever go completely to zero. Instead use some epsilon. I tried this and it seems to work for my simple straight line example. At least it seems to give a better result in my opinion.

If I make the epsilon 1e-10 or 1e-40, the computed result is now

3.1623, 3.1623, 3.1623, 3.1623, 3.1623, 3.1623

for distances, and

5.0000, 5.0000, 5.0000, 0, 1.0000, 2.0000

for indices.

barrybecker4 commented 5 years ago

Though the above suggestion is an improvement, it does not fix all the cases of NaN and Infinity. Another case, is for example, an upward sloping line like this:

1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4

For this the profile is:

NaN, NaN, 0.0035, NaN, NaN, POSITIVE_INFINITY, POSITIVE_INFINITY, POSITIVE_INFINITY 3.0000, 4.0000, 4.0000, 5.0000, 6.0000, 7.0000, 7.0000, 7.0000

The problem in this case seems to be at the end of the mass method, where we do the following

        INDArray dot = Nd4j.create(realDot, new int[]{1, realDot.length});
        INDArray res = dot.get(NDArrayIndex.interval(m - 1, dot.length())).div(stdv);
        INDArray r1 = res.neg().add(m).mul(2);
        INDArray r2 = sqrt(r1);
        return r2;

The values in res are all very close to the value of m, but not quite, so when r1 is computed the values are very near 0, and even less than 0 in some cases. The sqrt makes those values NaN.

If I change add(m) to add(m + 0.01), then the matrix profile is this

0.4472, 0.4472, 0.4472, 0.4472, 0.4472, 0.4471, 0.4471, 0.4471 5.0000, 5.0000, 5.0000, 6.0000, 1.0000, 1.0000, 1.0000, 1.0000

This seems better, but I don't know if its correct. It also effects all other test cases slightly, and there are still other cases where NaN and Infinity appear.

ensozos commented 5 years ago

I 'm not sure if this the way to fix the issue. I saw that MASS has different distance profile computation from ours. But there is also this implementation which is similar to the current version we have. We have to make sure if these versions are equivalent.

barrybecker4 commented 5 years ago

I just read the slides about MASS in https://www.cs.ucr.edu/~eamonn/Matrix_Profile_Tutorial_Part2.pdf again. There is a slide on the 3 sources of numerical error that can arise. I think the current java code does not have safeguards against all these forms of errors. I will try to add some.

barrybecker4 commented 5 years ago

I was able to address the second 2 sources of numerical instability by adding BooleanIndexing.replaceWhere(res, EPS, Conditions.lessThanOrEqual(EPS)); Right before the 2 places where sqrt is taken in the java mass method. This prevents the stdDev from becoming 0, and prevents a negative distance from becoming NaN in the final result. All results for distances and indices in my different manually added simple test cases work much better now. There are no more NaN's or Infinities in the results. The first case of instability is not something we have encountered yet in our test datasets, and am not much concerned about it for the time being. PR coming shortly.

ensozos commented 5 years ago

Fixed ,Merged.