datumbox / datumbox-framework

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
http://www.datumbox.com/
Apache License 2.0
1.09k stars 282 forks source link

WordSequenceExtractor can not work with MultinomialNaiveBayes Training #27

Closed jltchiu closed 6 years ago

jltchiu commented 6 years ago

Right now, I have a classifier working with NgramsExtractor and MultinomialNaiveBayes training. However, when I change the text extractor to WordSequenceExtractor, it will have error at the fitting stage (Same for UniqueWordSequenceExtractor):

6819 [main] INFO com.datumbox.framework.core.machinelearning.classification.MultinomialNaiveBayes - fit()
Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.ClassCastException
    at com.datumbox.framework.common.concurrency.ThreadMethods.forkJoinExecution(ThreadMethods.java:116)
    at com.datumbox.framework.common.concurrency.ForkJoinStream.forEach(ForkJoinStream.java:56)
    at com.datumbox.framework.core.machinelearning.common.abstracts.algorithms.AbstractNaiveBayes._fit(AbstractNaiveBayes.java:278)
    at com.datumbox.framework.core.machinelearning.common.abstracts.AbstractTrainer.fit(AbstractTrainer.java:125)
    at com.datumbox.framework.core.machinelearning.modelselection.Validator.validate(Validator.java:67)
    at com.avrio.AVcgclassifier.Classification.main(Classification.java:131)
Caused by: java.util.concurrent.ExecutionException: java.lang.ClassCastException
    at java.base/java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:996)
    at com.datumbox.framework.common.concurrency.ThreadMethods.forkJoinExecution(ThreadMethods.java:112)
    ... 5 more
Caused by: java.lang.ClassCastException
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488)
    at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:590)
    ... 7 more
Caused by: java.lang.ClassCastException
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488)
    at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:590)
    at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:668)
    at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:726)
    at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
    at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
    at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:430)
    at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:594)
    at com.datumbox.framework.common.concurrency.ForkJoinStream.lambda$forEach$0(ForkJoinStream.java:55)
    at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1393)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:283)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1603)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.lang.ClassCastException: java.base/java.lang.String cannot be cast to java.base/java.lang.Number
    at com.datumbox.framework.common.dataobjects.TypeInference.toDouble(TypeInference.java:163)
    at com.datumbox.framework.core.machinelearning.common.abstracts.algorithms.AbstractNaiveBayes.lambda$_fit$1(AbstractNaiveBayes.java:284)
    at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
    at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:747)
    ... 3 more

I assume there's some format change that causes this issue?

datumbox commented 6 years ago

The Naive Bayes model requires bag-of-words based extractors not sequence-based ones. Other models like LDA require sequences. It's all about what type of model you are using.