NaturalNode / apparatus

MIT License
128 stars 24 forks source link

Inconsistent Probability Results #10

Closed mamaral closed 7 years ago

mamaral commented 9 years ago

I'm working on reverse engineering the bayes classifier algorithm to better understand how it works under the covers, and am seeing what appears to be inconsistencies with the results from the probabilityOfClass and classify functions. I have a hunch it may be related to https://github.com/NaturalNode/apparatus/issues/7, but am not sure. Here are some examples showing what I've been seeing.

var bayes = new BayesClassifier();
bayes.addExample([1,1,1,0,0,0], 'one');
bayes.addExample([1,0,1,0,0,0], 'one');
bayes.addExample([0,0,0,1,1,0], 'two');
bayes.addExample([0,0,0,1,1,0], 'two');
bayes.addExample([0,0,0,1,1,1], 'two');

bayes.train();

console.log(bayes.classify([1,1,1,0,0,0]));

The above code outputs one, as I would expect, with the following values:

 { label: 'one', value: 0.3333333333333333 },
 { label: 'two', value: 0.010416666666666671 }

var bayes = new BayesClassifier();
bayes.addExample([1,1,1,0,0,0], 'one');
bayes.addExample([1,0,1,0,0,0], 'one');
bayes.addExample([0,0,0,1,1,0], 'two');
bayes.addExample([0,0,0,1,1,0], 'two');
bayes.addExample([0,0,0,1,1,1], 'two');

bayes.train();

console.log(bayes.classify([0,0,0,1,1,1]));

The above code outputs two, as I would expect, with the following values:

 { label: 'two', value: 0.3333333333333333 },
 { label: 'one', value: 0.018518518518518517 }

var bayes = new BayesClassifier();
bayes.addExample([1,1,1,0,0,0], 'one');
bayes.addExample([1,0,1,0,0,0], 'one');
bayes.addExample([0,0,0,1,1,0], 'two');
bayes.addExample([0,0,0,1,1,0], 'two');
bayes.addExample([0,0,0,1,1,1], 'two');

bayes.train();

console.log(bayes.classify([1,1,1,1,1,1]));

The above code outputs one, which is _not_ what I would expect. The probabilityOfClass function assigns the following values for each class:

{ label: 'one', value: 0.012345679012345675 },
 { label: 'two', value: 0.005208333333333334 }

My expectation is that given an array of "observations" where both classes are represented equally, those observations would be a better "match" to the class with more closely related examples? In other words, [1,1,1,1,1,1] has the same amount of "perfect" matches in class one compared to class two, but it also has more "partial" matches in class two, so why would it be a better "fit" for class one? Perhaps what we need is some sort of prior probability here?

Any clarification, especially if my understanding is flawed (which is likely), would be fantastic.

DrDub commented 7 years ago

Hi,

I reproduce your execution traces and results. The classifier is:

classFeatures: { one: { '0': 3, '1': 2, '2': 3 }, two: { '3': 4, '4': 4, '5': 2 } }, classTotals: { one: 3, two: 4 }, totalExamples: 6, smoothing: 1 }

Which give us:

3/3 x 2/3 x 3/3 x 1/3 x 1/3 x 1/3 = 1 x 0.66 x 1 x 0.33 x 0.33 x 0.33 = 0.023718420000000004 1/4 x 1/4 x 1/4 x 4/4 4/4 x 2/4 = 0.25 x 0.25 x 0.25 x 1 x 1 x 0.5 = 0.0078125

The prob is better for "one" here, but the priors are in favor of "two" so the final numbers are closer for two:

prior("one") = 3/6 = 0.5 prior("two") = 4/6 = 0.66

final "one" = 0.023718420000000004 * 0.5 = 0.011859210000000002 <<- rounding error? final "two" = 0.005208333333333333

Now, I understand your thought process, but adding the extra training instance for "two" means the one in column 6 for "two" is worth 1/3 (1/4 smoothed) while the one in column 2 for "one" is worth 1/2 (1/3 smoothed).

Working without smoothing and ignoring the other columns you get then

one = 2/2 x 1/2 x 2/2 = 0.5 two = 3/3 x 3/3 x 1/3 = 0.33

(one is more likely)

and then adding the priors:

0.5_0.4 = 0.2 0.33_0.66 = 0.2178

two wins by very little.

This is of course not correct because without any smoothing both probabilities are zero! As soon as you add any smoothing that little difference in favor of two disappears.

If you still feel this is an error, feel free to re-open this bug.

mamaral commented 7 years ago

I haven't used or looked at this for a few years now, so its all completely gone from my brain. I have a feeling I was mistaken at the time with this case anyway, but thank you for getting back to me. :)