Open asfimport opened 16 years ago
Otis Gospodnetic (@otisg) (migrated from JIRA)
Skimmed this very quickly - looks nice and clean to me! Why is this not in contrib yet? I didn't spot any dependencies....are there any?
Karl Wettin (migrated from JIRA)
Otis Gospodnetic - 03/Dec/07 11:22 PM > Skimmed this very quickly - looks nice and clean to me! > Why is this not in contrib yet? I didn't spot any dependencies....are there any?
No dependencies, although I get a 5x-10x faster classifier using #1628 while trained with 15,000 small instances (documents).
One reason that this is not in the contrib might be that it is based on an O'Reilly book. That book contains an example implementation in Python but my code does not have much in common with it, except for the Greek kung fu found by a Brittish priest 250 years ago.
IANAL, but according to what I've read in the preface there are no problems releasing this with ASL.
Talk to permissions@oreilly.com if you really want to make sure. I can supply you with the Python code example if you want to compare. The book is however worth the $40 if you want to understand whats going on in there.
Paul Elschot (migrated from JIRA)
DId you consider using lucene's termvectors? Some of the feature extractions would be easier to do with termvectors, especially when the index contains many more docs than the ones on which the classifier is built. Classifying a document from its termvector is also quite natural.
Karl Wettin (migrated from JIRA)
DId you consider using lucene's termvectors? Some of the feature extractions would be easier to do with termvectors,
Not sure what you mean, they are already used when extracting features? Or do you speak of using the term vectors as training instance data when classifying? Bayesian classification can rely on class feature frequency alone.
especially when the index contains many more docs than the ones on which the classifier is built.
The more documents not used for classification, the more scew the classification results will be as Pr(feature|class) is based on docFreq and numDocs in this implementation.
Paul Elschot (migrated from JIRA)
I'll have a more thorough look at the code, but do I understand correctly that it is using a lucene index per class?
I'm just now building a Bayesian classifier using a single index with a field for the features (text terms) and a field for the classes. The feature field also has termvectors, and these make the implementation for training and classifying quite straightforward, after using some queries on the class field to get the doc ids for each class. Also, termvectors allow both a boolean and a strength implementation for the features. The strength is based on the frequency info in the term vectors that have the term frequency within a doc.
Karl Wettin (migrated from JIRA)
do I understand correctly that it is using a lucene index per class?
One index per classifier. Each classifier can contain multiple classes. In the test case the field "class" is used to keep track of classes. Each document must only contain one token in the class field. Features can be stored in any number of fields.
Cuong Hoang (migrated from JIRA)
>>Each document must only contain one token in the class field
Does that mean each document in the training set can only belong to one class?
I try to run the test case but get NullPointerException:
TestClassifier org.apache.lucene.classifier.TestClassifier test(org.apache.lucene.classifier.TestClassifier) java.lang.NullPointerException at org.apache.lucene.index.MultiTermDocs.doc(MultiReader.java:356) at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:92) at org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:137) at org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54) at org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:72) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:70) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:62) at org.apache.lucene.classifier.TestClassifier.testClassifier(TestClassifier.java:110) at org.apache.lucene.classifier.TestClassifier.test(TestClassifier.java:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
Karl Wettin (migrated from JIRA)
Cuong Hoang - 03/Apr/08 06:28 PM >>Each document must only contain one token in the class field >Does that mean each document in the training set can only belong to one class?
You can have multiple class fields, but you can only classify an instance to one class at the time. Currently class and classes buffer is set in instances, I think it should be possible to move that code to NaiveBayesClassifier to allow classification on multiple classes on the same Instances.
Instances.java:
private String classField;
private String[] classes;
>I try to run the test case but get NullPointerException:
> at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:92)
The pass tests here, did you perhaps alter the content in some way?
In BayesianClassifier.java, add the following on row 92:
classDocs.seek(new Term(instances.getClassField(), _class));
+ classDocs.next();
while (featureDocs.next()) {
Does that help?
Karl Wettin (migrated from JIRA)
I close this issue due to uncertainy about intellectual property rights, pending an answer from Toby. I've tried to contact him several times via numerus media without response : (
Toby Segaran (migrated from JIRA)
I'm the author of "Programming Collective Intelligence". I see no issue with property rights, the algorithm itself is widely known and my book just explains it. The code Karl wrote is completely original.
Karl Wettin (migrated from JIRA)
What do you people think, should I commit this to Lucene or Mahout?
Vaijanath N. Rao (migrated from JIRA)
Hi Karl,
Can you tell me how to use this with FSDirectory() rather then RAMDirectory(). I am getting following error
Exception in thread "main" java.lang.NullPointerException at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.doc(MultiSegmentReader.java:552) at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:94) at org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:139) at org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54) at org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:71) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:72) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:64)
When I am trying to use the FSDirectory(). I created the instance Index as per the test sample and closed it. Now while doing a classification I am getting the above error.
The way I create the directory is:
FSDirectory dir = FSDirectory.getDirectory(new File(indexPath));
IndexWriter iw = new IndexWriter(dir,instanceFactory.getAnalyzer(),create, MaxFieldLength.LIMITED);
iw.close();
The code for addinig the instance is : instances.addInstance(record.getText(), record.getClass());
instance.flush() and instance.close() all go fine.
While doing classification I again open the directory ( with just create set to false ) and rest call remains the same.
Instances instances = new Instances(dir, indexCreator.instanceFactory, "class"); classifier = new NaiveBayesClassifier(); return classifier.classify(instances, text)[0].getClassification();
Can you help me in pointing out where I am doing wrong.
--Thanks and Regards Vaijanath N. Rao
Karl Wettin (migrated from JIRA)
Vaijanath,
can you please post a small test case that demonstrates the problem?
Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and Fisher method algorithms as described by Toby Segaran in "Programming Collective Intelligence", ISBN 978-0-596-52932-1.
Have fun.
Poor java docs, but the TestCase shows how to use it:
Migrated from LUCENE-1039 by Karl Wettin, 1 vote, updated May 04 2010 Attachments: LUCENE-1039.txt