Gaussian code not matching scikit-learn output?

Tradeshift / blayze

A fast and flexible Naive Bayes implementation for the JVM

MIT License

19 stars 11 forks source link

Gaussian code not matching scikit-learn output? #17

Closed liufuyang closed 5 years ago

liufuyang commented 5 years ago

I seem to find a few issues that I don't know if they are really matters. I will create two issues and separate the discussion on the issue page.

Please take a look at the first test test_gaussian_by_comparing_scikit_learn_output added on PR: https://github.com/Tradeshift/blayze/pull/16

Why our gaussian output looks different a scikit-learn output.

I also can't understand the code in the gaussain feature for calculate it's log prob. I am looking at this http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.302.7503&rep=rep1&type=pdf also http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

but can't see how our delta delta2 m2 matches the equation in those papers.

rasmusbergpalm commented 5 years ago

I don't know why sklearn gives those probabilities, but just eyeballing I think they look off. Why should it be 98% sure of "sweater" for 0 degrees, when there's equal evidence for sweater around -100 and evidence for t-shirt around +100. I'd think it should be 50/50, which I'm guessing is also what Blayze says. Maybe the partial_fit isn't doing what you think it is.

liufuyang commented 5 years ago

the big difference is introduce by slight diff of variance!

liufuyang commented 5 years ago

noticing the 98.0 and 98.2 diff!

our model output is {t-shirt=0.04812839716239359, sweater=0.9518716028376064} I am wondering why this 0.95 and0.98 diff. seems a bit too large.

rasmusbergpalm commented 5 years ago

Ah, yes. Ok. Then the high probability makes sense.

I'm don't understand how sklearn computes the variance. I think they use the method suggested here: http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf We use http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.302.7503&rep=rep1&type=pdf

I think we're down to numerical rounding issues. Both of the gaussian distributions are very very close to 0 probability at 0 degrees. Tiny differences in the variance will start playing in here. I encourage you to compute the exact variances and see which method is closer.

rasmusbergpalm commented 5 years ago

Since we've upgraded to Bayesian Naive Bayes and sklearn uses Maximum Likelihood Naive Bayes, they should no longer give the same results. Closing.