Netflix / Surus

Apache License 2.0
459 stars 106 forks source link

Diffrent result between R and Java version #16

Open ghost opened 8 years ago

ghost commented 8 years ago

I have a time series (56 observations) like this:

 data ={    3.197097, 3.029077, 3.005744, 2.969745, 2.988609, 2.97782, 2.933626,
               3.185347, 3.241275, 3.117891, 3.071268, 3.118897, 3.152572, 3.232348,
               3.424237, 3.323964, 3.302709, 3.341312, 3.341527, 3.375134, 3.543823,
               3.879864, 3.420371, 3.294217, 3.49587, 3.521571, 3.599039, 3.925218,
               3.99248, 3.689928, 3.749015, 3.583267, 3.704804, 3.742834, 3.599793,
               3.699821, 3.630572, 3.684399, 3.725435, 3.743818, 3.744296, 3.667758,
               3.899343, 3.724631, 3.551779, 3.557395, 3.748661, 3.569791, 3.520395,
               3.529122, 3.604996, 3.623308, 3.586358, 3.793575, 3.837355, 3.753702}

When I run with R:

library(RAD) data = c(3.197097, 3.029077, 3.005744, 2.969745, 2.988609, 2.97782, 2.933626, 3.185347, 3.241275, 3.117891, 3.071268, 3.118897, 3.152572, 3.232348, 3.424237, 3.323964, 3.302709, 3.341312, 3.341527, 3.375134, 3.543823, 3.879864, 3.420371, 3.294217, 3.49587, 3.521571, 3.599039, 3.925218, 3.99248, 3.689928, 3.749015, 3.583267, 3.704804, 3.742834, 3.599793, 3.699821, 3.630572, 3.684399, 3.725435, 3.743818, 3.744296, 3.667758, 3.899343, 3.724631, 3.551779, 3.557395, 3.748661, 3.569791, 3.520395, 3.529122, 3.604996, 3.623308, 3.586358, 3.793575, 3.837355, 3.753702)

a=AnomalyDetection.rpca(data, frequency = 7)

S_matrix=a$S_transform

View(data.frame(S_matrix))

It returns a vector, with the length is 55 (less 1 than the number of the data): (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.130887399206267, 0, 0, 0, 0, 0.00318375301259443, 0, -0.0885939624397428, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00646893256411638, 0, 0)

It says that we got 4 anomaly points.


When It comes to Java:

I just adjust the input in RAD_test file.

double[] ts = new double[] { 3.197097, 3.029077, 3.005744, 2.969745, 2.988609, 2.97782, 2.933626, 3.185347, 3.241275, 3.117891, 3.071268, 3.118897, 3.152572, 3.232348, 3.424237, 3.323964, 3.302709, 3.341312, 3.341527, 3.375134, 3.543823, 3.879864, 3.420371, 3.294217, 3.49587, 3.521571, 3.599039, 3.925218, 3.99248, 3.689928, 3.749015, 3.583267, 3.704804, 3.742834, 3.599793, 3.699821, 3.630572, 3.684399, 3.725435, 3.743818, 3.744296, 3.667758, 3.899343, 3.724631, 3.551779, 3.557395, 3.748661, 3.569791, 3.520395, 3.529122, 3.604996, 3.623308, 3.586358, 3.793575, 3.837355, 3.753702};

I got the S_matrix from observed as below:

(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.129439154052943, 0, -0.0123981691519606, 0, 0, 0, 0.168267483707591, 0.156119414262503, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.104831613422113, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) In this case, we had 5 anomaly points.

as you can see, the Java version returns a vector much different from R version. Beside the length (55 of R, to compare with 56 of Java, It doesnt matter), the values are a big deal. With each version, I get a brand new result (4 versus 5 anomaly points). It makes me so confused.

I hope you can help me out.

Thank you so much.

zhangxiangnick commented 7 years ago

The reason of 55 points instead of 56 is that the R implementation checks DickeyFuller stationary test and has an extra of step of making the time series stationary (first do difference and then drop one point). However, I think the R version of DickeyFull test has a bug, see my comments here https://github.com/Netflix/Surus/issues/14. If you turn off this test for both R and Java, you should get consistent results.