cbluth / aima-python

Automatically exported from code.google.com/p/aima-python
0 stars 0 forks source link

Errors in NearestNeighborLearner #21

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
Tried to use the learning.NearestNeighborLearner on the Sex Classification 
dataset from this Wikipedia article on Naive Bayes classifiers: 
http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_Classification

What is the expected output? What do you see instead?
Program wouldn't run due to bugs in the implementation of NNLearner

What version of the product are you using?
Bug exists in r30

Please provide any additional information below.
Here's my sample code:

import learning

examples = 
[[6,180,12,'male'],[5.92,190,11,'male'],[5.58,170,12,'male'],[5,100,6,'female'],
[5.5,150,8,'female'],[5.42,130,7,'female'],[5.75,150,9,'female']]

ds = learning.DataSet(examples)
nnl = learning.NearestNeighborLearner(2)
nnl.train(ds)
print nnl.predict([5.1,105,6.3])

And I would expect it to print 'female'.

I believe the following fixes should work:
old learning.py, lines 217 - 231

        else:
            ## Maintain a sorted list of (distance, example) pairs.
            ## For very large k, a PriorityQueue would be better
            best = [] 
            for e in examples:
                d = self.distance(e, example)
                if len(best) < k: 
                    e.append((d, e))
                elif d < best[-1][0]:
                    best[-1] = (d, e)
                    best.sort()
            return mode([e[self.dataset.target] for (d, e) in best])

    def distance(self, e1, e2):
        return mean_boolean_error(e1, e2)

new learning.py:

        else:
            ## Maintain a sorted list of (distance, example) pairs.
            ## For very large k, a PriorityQueue would be better
            best = [] 
            for e in self.dataset.examples:
                d = self.distance(e, example)
                if len(best) < self.k: 
                    best.append((d, e))
                elif d < best[-1][0]:
                    best[-1] = (d, e)
                    best.sort()
            return mode([e[self.dataset.target] for (d, e) in best])

    def distance(self, e1, e2):
        return mean_error(e1, e2)

Specifically:
1) changed 'examples' to self.dataset.examples. 
2) changed e.append((d,e)) to best.append((d, e))
3) and I could be wrong, but I believe you wanted mean_error, not 
mean_boolean_error in your distance function.

For the gender classification example, it seems to work great. Thanks!

Original issue reported on code.google.com by tblana...@gmail.com on 19 Oct 2010 at 5:21

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Whether to use mean_error or mean_boolean_error depends on the dataset, I 
believe, and the code needs some larger change to be able to do either 
generically. For now, mean_boolean_error at least never blows up. The rest of 
this looks right -- thanks!

Original comment by wit...@gmail.com on 15 Sep 2011 at 2:40

GoogleCodeExporter commented 8 years ago
This issue was closed by revision r71.

Original comment by wit...@gmail.com on 15 Sep 2011 at 2:41