Open barrybecker4 opened 5 years ago
The source of the problem seems to be movStd in Mass.java. The moving standard deviation can be 0 or even slightly less than zero in some cases (perhaps because of numerical error). Then later, at the very end of the mass method, we divide by the stdDev vector. If it has 0's (or NaN's) in it, then the result will have Infinity (or NaN's). I propose that we never let the values in the stdDev vector ever go completely to zero. Instead use some epsilon. I tried this and it seems to work for my simple straight line example. At least it seems to give a better result in my opinion.
If I make the epsilon 1e-10 or 1e-40, the computed result is now
3.1623, 3.1623, 3.1623, 3.1623, 3.1623, 3.1623
for distances, and
5.0000, 5.0000, 5.0000, 0, 1.0000, 2.0000
for indices.
Though the above suggestion is an improvement, it does not fix all the cases of NaN and Infinity. Another case, is for example, an upward sloping line like this:
1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4
For this the profile is:
NaN, NaN, 0.0035, NaN, NaN, POSITIVE_INFINITY, POSITIVE_INFINITY, POSITIVE_INFINITY 3.0000, 4.0000, 4.0000, 5.0000, 6.0000, 7.0000, 7.0000, 7.0000
The problem in this case seems to be at the end of the mass method, where we do the following
INDArray dot = Nd4j.create(realDot, new int[]{1, realDot.length});
INDArray res = dot.get(NDArrayIndex.interval(m - 1, dot.length())).div(stdv);
INDArray r1 = res.neg().add(m).mul(2);
INDArray r2 = sqrt(r1);
return r2;
The values in res are all very close to the value of m, but not quite, so when r1 is computed the values are very near 0, and even less than 0 in some cases. The sqrt makes those values NaN.
If I change add(m) to add(m + 0.01), then the matrix profile is this
0.4472, 0.4472, 0.4472, 0.4472, 0.4472, 0.4471, 0.4471, 0.4471 5.0000, 5.0000, 5.0000, 6.0000, 1.0000, 1.0000, 1.0000, 1.0000
This seems better, but I don't know if its correct. It also effects all other test cases slightly, and there are still other cases where NaN and Infinity appear.
I just read the slides about MASS in https://www.cs.ucr.edu/~eamonn/Matrix_Profile_Tutorial_Part2.pdf again. There is a slide on the 3 sources of numerical error that can arise. I think the current java code does not have safeguards against all these forms of errors. I will try to add some.
I was able to address the second 2 sources of numerical instability by adding
BooleanIndexing.replaceWhere(res, EPS, Conditions.lessThanOrEqual(EPS));
Right before the 2 places where sqrt is taken in the java mass method. This prevents the stdDev from becoming 0, and prevents a negative distance from becoming NaN in the final result. All results for distances and indices in my different manually added simple test cases work much better now. There are no more NaN's or Infinities in the results.
The first case of instability is not something we have encountered yet in our test datasets, and am not much concerned about it for the time being. PR coming shortly.
Fixed ,Merged.
While adding more tests, I noticed that there are a number of cases where the distance part of the matrix profile can have Infinity and NaN values. One simple example (but not the only one) is a straight line.
If I run matrix profile on this straight line series of 10 y values
with a window of 5, then I get MP distance values of
And MP index values of
There are 6 values in the profile (as expected). The length should be 10 – windowSize + 1 = 6. I don’t understand yet why there are NaN values. I was expecting all 0’s for the distances. Maybe it has to do with z-normalization.
I will look at the code more closely and try to make a proposal and perhaps PR to avoid these values.