datumbox / datumbox-framework

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
http://www.datumbox.com/
Apache License 2.0
1.09k stars 282 forks source link

Possible Error in Shapiro-Wilk P-Value #33

Closed oliver-krauss closed 5 years ago

oliver-krauss commented 5 years ago

Hi,

I tried out your Shapiro-Wilk implementation as I needed to calculate some values for a paper-submission. I did cross reference it with the Real-Statistics Excel Plugin (http://www.real-statistics.com/) as well as several online tools that allow use of Shapiro-Wilk.

If you run it with the following values: 488.0, 486.0, 492.0, 490.0, 489.0, 491.0, 488.0, 490.0, 496.0, 487.0, 487.0, 493.0 The results should be: W -> 0.944486 P -> 0.55826969...

However the P value is 0.44173030948

The offending line is here: https://github.com/datumbox/datumbox-framework/blob/957560901f3c87d3e9f6760263644a4d70b0a3b8/datumbox-framework-core/src/main/java/com/datumbox/framework/core/statistics/nonparametrics/onesample/ShapiroWilk.java#L257

It should actually be m-y pw=ContinuousDistributions.gaussCdf((m-y)/s)

datumbox commented 5 years ago

Hi Oliver,

Thanks for reporting. You are right. It's been a while since I wrote this so I did a few things to confirm the problem.

I tried your values using this online tool and it reports a p-value of 0.557475. Note that the discrepancy on the 3rd decimal can be attributed to the use of different approximations.

I also compared my implementation directly with the one of R: shapiro.test(c(488.0, 486.0, 492.0, 490.0, 489.0, 491.0, 488.0, 490.0, 496.0, 487.0, 487.0, 493.0))

The result is p-value = 0.5583

After applying your patch, my implementation reports a p-value of 0.55826969.

I can patch it, but if you prefer send me a Pull-request and I'll merge it ASAP. I can also issue a hotfix if you want to unblock your work.

oliver-krauss commented 5 years ago

Hi Vasilis,

thanks for the quick reply and confirmation. Take your time, no hotfix necessary for me. I just wanted to notify you.

datumbox commented 5 years ago

I patched it on the develop branch and also pushed a new snapshot version on the repo (0.8.2-SNAPSHOT). I'll probably take care of a few more minor improvements and release a stable version soon.

Thanks for the help! Vasilis

oliver-krauss commented 5 years ago

Ok. I did a little deeper dive into this and I think the Bugfix I proposed just introduced another bug.

the ContinuousDistributions.gaussCdfcalculates the cumulative probability of a gaussian (normal) distribution right? In which case m would be the mean of the distribution an s would be the sigma.

In this case the original line pw=ContinuousDistributions.gaussCdf((y-m)/s); is CORRECT, but fails when the mean becomes negative (which it is for the provided example). When switching from y-m to m-y this is inadvertedly fixed as this moves the value to the "correct side" of the mean, as long as the mean is negative.

Unfortunately I haven't been able to produce any example where m becomes positive, so the fix may still apply, but the safer fix would be:

if (m < 0.0) {
  m *= -1.0;
  y *= -1.0;
}
pw=ContinuousDistributions.gaussCdf((y-m)/s);
datumbox commented 5 years ago

I think I will need to investigate this more thoroughly to align the behaviour of the class with the original implementation. I will be investigating this later this week.

Cheers

datumbox commented 5 years ago

Could you please check again the latest develop branch? I believe it's now patched.

The results agree with R for the following examples:


    Shapiro-Wilk normality test

data:  c(33.4, 33.3, 31, 31.4, 33.5, 34.4, 33.7, 36.2, 34.9, 37)
W = 0.95509, p-value = 0.7288

    Shapiro-Wilk normality test

data:  c(35, 45, 55, 58, 61, 63, 65, 68, 70, 72, 74, 86)
W = 0.97107, p-value = 0.9216

    Shapiro-Wilk normality test

data:  c(488, 486, 492, 490, 489, 491, 488, 490, 496, 487, 487, 493)
W = 0.94449, p-value = 0.5583
datumbox commented 5 years ago

@oliver-krauss Did you manage by any chance to check the fix? It should be returning the right values now.

oliver-krauss commented 5 years ago

Hi, sorry for the delay was on a trip. I can confirm that this works with my own values as well, which I did cross reference with different tools as well.

datumbox commented 5 years ago

Awesome thanks for reporting and for contributing to the fix. I will take care of a couple of remaining updates and issue a stable version soon. For now the latest 0.8.2-SNAPSHOT version contains the patch.