Closed liufuyang closed 5 years ago
I don't know why sklearn gives those probabilities, but just eyeballing I think they look off. Why should it be 98% sure of "sweater" for 0 degrees, when there's equal evidence for sweater around -100 and evidence for t-shirt around +100. I'd think it should be 50/50, which I'm guessing is also what Blayze says. Maybe the partial_fit isn't doing what you think it is.
the big difference is introduce by slight diff of variance!
noticing the 98.0 and 98.2 diff!
our model output is {t-shirt=0.04812839716239359, sweater=0.9518716028376064}
I am wondering why this 0.95
and0.98
diff. seems a bit too large.
Ah, yes. Ok. Then the high probability makes sense.
I'm don't understand how sklearn computes the variance. I think they use the method suggested here: http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf We use http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.302.7503&rep=rep1&type=pdf
I think we're down to numerical rounding issues. Both of the gaussian distributions are very very close to 0 probability at 0 degrees. Tiny differences in the variance will start playing in here. I encourage you to compute the exact variances and see which method is closer.
Since we've upgraded to Bayesian Naive Bayes and sklearn uses Maximum Likelihood Naive Bayes, they should no longer give the same results. Closing.
I seem to find a few issues that I don't know if they are really matters. I will create two issues and separate the discussion on the issue page.
Please take a look at the first test
test_gaussian_by_comparing_scikit_learn_output
added on PR: https://github.com/Tradeshift/blayze/pull/16Why our gaussian output looks different a scikit-learn output.
I also can't understand the code in the gaussain feature for calculate it's log prob. I am looking at this http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.302.7503&rep=rep1&type=pdf also http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
but can't see how our
delta
delta2
m2
matches the equation in those papers.