Bayesian classifiers using Lucene as data store [LUCENE-1039]

asfimport commented 16 years ago

Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and Fisher method algorithms as described by Toby Segaran in "Programming Collective Intelligence", ISBN 978-0-596-52932-1.

Have fun.

Poor java docs, but the TestCase shows how to use it:

public class TestClassifier extends TestCase {

  public void test() throws Exception {

    InstanceFactory instanceFactory = new InstanceFactory() {

      public Document factory(String text, String _class) {
        Document doc = new Document();
        doc.add(new Field("class", _class, Field.Store.YES, Field.Index.NO_NORMS));

        doc.add(new Field("text", text, Field.Store.YES, Field.Index.NO, Field.TermVector.NO));

        doc.add(new Field("text/ngrams/start", text, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES));
        doc.add(new Field("text/ngrams/inner", text, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES));
        doc.add(new Field("text/ngrams/end", text, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES));
        return doc;
      }

      Analyzer analyzer = new Analyzer() {
        private int minGram = 2;
        private int maxGram = 3;

        public TokenStream tokenStream(String fieldName, Reader reader) {
          TokenStream ts = new StandardTokenizer(reader);
          ts = new LowerCaseFilter(ts);
          if (fieldName.endsWith("/ngrams/start")) {
            ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram);
          } else if (fieldName.endsWith("/ngrams/inner")) {
            ts = new NGramTokenFilter(ts, minGram, maxGram);
          } else if (fieldName.endsWith("/ngrams/end")) {
            ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, minGram, maxGram);
          }
          return ts;
        }
      };

      public Analyzer getAnalyzer() {
        return analyzer;
      }
    };

    Directory dir = new RAMDirectory();
    new IndexWriter(dir, null, true).close();

    Instances instances = new Instances(dir, instanceFactory, "class");

    instances.addInstance("hello world", "en");
    instances.addInstance("hallå världen", "sv");

    instances.addInstance("this is london calling", "en");
    instances.addInstance("detta är london som ringer", "sv");

    instances.addInstance("john has a long mustache", "en");
    instances.addInstance("john har en lång mustache", "sv");

    instances.addInstance("all work and no play makes jack a dull boy", "en");
    instances.addInstance("att bara arbeta och aldrig leka gör jack en trist gosse", "sv");

    instances.addInstance("shrimp sandwich", "en");
    instances.addInstance("räksmörgås", "sv");

    instances.addInstance("it's now or never", "en");
    instances.addInstance("det är nu eller aldrig", "sv");

    instances.addInstance("to tie up at a landing-stage", "en");
    instances.addInstance("att angöra en brygga", "sv");

    instances.addInstance("it's now time for the children's television shows", "en");
    instances.addInstance("nu är det dags för barnprogram", "sv");

    instances.flush();

    testClassifier(instances, new NaiveBayesClassifier());
    testClassifier(instances, new FishersMethodClassifier());

    instances.close();
  }

  private void testClassifier(Instances instances, BayesianClassifier classifier) throws IOException {

    assertEquals("sv", classifier.classify(instances, "detta blir ett test")[0].getClassification());
    assertEquals("en", classifier.classify(instances, "this will be a test")[0].getClassification());

    // test training data instances. all ought to match!
    for (int documentNumber = 0; documentNumber < instances.getIndexReader().maxDoc(); documentNumber++) {
      if (!instances.getIndexReader().isDeleted(documentNumber)) {
        Map<Term, Double> features = instances.extractFeatures(instances.getIndexReader(), documentNumber, classifier.isNormalized());
        Document document = instances.getIndexReader().document(documentNumber);
        assertEquals(document.get("class"), classifier.classify(instances, features)[0].getClassification());
      }
    }
  }

Migrated from LUCENE-1039 by Karl Wettin, 1 vote, updated May 04 2010 Attachments: LUCENE-1039.txt

asfimport commented 16 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Skimmed this very quickly - looks nice and clean to me! Why is this not in contrib yet? I didn't spot any dependencies....are there any?

asfimport commented 16 years ago

Karl Wettin (migrated from JIRA)

Otis Gospodnetic - 03/Dec/07 11:22 PM > Skimmed this very quickly - looks nice and clean to me! > Why is this not in contrib yet? I didn't spot any dependencies....are there any?

No dependencies, although I get a 5x-10x faster classifier using #1628 while trained with 15,000 small instances (documents).

One reason that this is not in the contrib might be that it is based on an O'Reilly book. That book contains an example implementation in Python but my code does not have much in common with it, except for the Greek kung fu found by a Brittish priest 250 years ago.

IANAL, but according to what I've read in the preface there are no problems releasing this with ASL.

Talk to permissions@oreilly.com if you really want to make sure. I can supply you with the Python code example if you want to compare. The book is however worth the $40 if you want to understand whats going on in there.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

DId you consider using lucene's termvectors? Some of the feature extractions would be easier to do with termvectors, especially when the index contains many more docs than the ones on which the classifier is built. Classifying a document from its termvector is also quite natural.

asfimport commented 16 years ago

Karl Wettin (migrated from JIRA)

DId you consider using lucene's termvectors? Some of the feature extractions would be easier to do with termvectors,

Not sure what you mean, they are already used when extracting features? Or do you speak of using the term vectors as training instance data when classifying? Bayesian classification can rely on class feature frequency alone.

especially when the index contains many more docs than the ones on which the classifier is built.

The more documents not used for classification, the more scew the classification results will be as Pr(feature|class) is based on docFreq and numDocs in this implementation.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

I'll have a more thorough look at the code, but do I understand correctly that it is using a lucene index per class?

I'm just now building a Bayesian classifier using a single index with a field for the features (text terms) and a field for the classes. The feature field also has termvectors, and these make the implementation for training and classifying quite straightforward, after using some queries on the class field to get the doc ids for each class. Also, termvectors allow both a boolean and a strength implementation for the features. The strength is based on the frequency info in the term vectors that have the term frequency within a doc.

asfimport commented 16 years ago

Karl Wettin (migrated from JIRA)

do I understand correctly that it is using a lucene index per class?

One index per classifier. Each classifier can contain multiple classes. In the test case the field "class" is used to keep track of classes. Each document must only contain one token in the class field. Features can be stored in any number of fields.

asfimport commented 16 years ago

Cuong Hoang (migrated from JIRA)

>>Each document must only contain one token in the class field

Does that mean each document in the training set can only belong to one class?

I try to run the test case but get NullPointerException:

TestClassifier org.apache.lucene.classifier.TestClassifier test(org.apache.lucene.classifier.TestClassifier) java.lang.NullPointerException at org.apache.lucene.index.MultiTermDocs.doc(MultiReader.java:356) at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:92) at org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:137) at org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54) at org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:72) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:70) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:62) at org.apache.lucene.classifier.TestClassifier.testClassifier(TestClassifier.java:110) at org.apache.lucene.classifier.TestClassifier.test(TestClassifier.java:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

asfimport commented 16 years ago

Karl Wettin (migrated from JIRA)

Cuong Hoang - 03/Apr/08 06:28 PM >>Each document must only contain one token in the class field >Does that mean each document in the training set can only belong to one class?

You can have multiple class fields, but you can only classify an instance to one class at the time. Currently class and classes buffer is set in instances, I think it should be possible to move that code to NaiveBayesClassifier to allow classification on multiple classes on the same Instances.

Instances.java:

  private String classField;
  private String[] classes;

>I try to run the test case but get NullPointerException:

> at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:92)

The pass tests here, did you perhaps alter the content in some way?

In BayesianClassifier.java, add the following on row 92:

    classDocs.seek(new Term(instances.getClassField(), _class));
+    classDocs.next();
    while (featureDocs.next()) {

Does that help?

asfimport commented 16 years ago

Karl Wettin (migrated from JIRA)

I close this issue due to uncertainy about intellectual property rights, pending an answer from Toby. I've tried to contact him several times via numerus media without response : (

asfimport commented 15 years ago

Toby Segaran (migrated from JIRA)

I'm the author of "Programming Collective Intelligence". I see no issue with property rights, the algorithm itself is widely known and my book just explains it. The code Karl wrote is completely original.

asfimport commented 15 years ago

Karl Wettin (migrated from JIRA)

What do you people think, should I commit this to Lucene or Mahout?

asfimport commented 15 years ago

Vaijanath N. Rao (migrated from JIRA)

Hi Karl,

Can you tell me how to use this with FSDirectory() rather then RAMDirectory(). I am getting following error

Exception in thread "main" java.lang.NullPointerException at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.doc(MultiSegmentReader.java:552) at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:94) at org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:139) at org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54) at org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:71) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:72) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:64)

When I am trying to use the FSDirectory(). I created the instance Index as per the test sample and closed it. Now while doing a classification I am getting the above error.

The way I create the directory is:

        FSDirectory dir = FSDirectory.getDirectory(new File(indexPath));
        IndexWriter iw = new IndexWriter(dir,instanceFactory.getAnalyzer(),create, MaxFieldLength.LIMITED);
        iw.close();

The code for addinig the instance is : instances.addInstance(record.getText(), record.getClass());

instance.flush() and instance.close() all go fine.

While doing classification I again open the directory ( with just create set to false ) and rest call remains the same.

Instances instances = new Instances(dir, indexCreator.instanceFactory, "class"); classifier = new NaiveBayesClassifier(); return classifier.classify(instances, text)[0].getClassification();

Can you help me in pointing out where I am doing wrong.

--Thanks and Regards Vaijanath N. Rao

asfimport commented 15 years ago

Karl Wettin (migrated from JIRA)

Vaijanath,

can you please post a small test case that demonstrates the problem?

apache / lucene

Bayesian classifiers using Lucene as data store [LUCENE-1039] #2115