GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

applyTopicModel/MalletLDA: exception when converting feature sequence #76

Closed johann-petrak closed 6 years ago

johann-petrak commented 6 years ago

Exception is

java.lang.IndexOutOfBoundsException: Index -1 out-of-bounds for length 921
    at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
    at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
    at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
    at java.base/java.util.Objects.checkIndex(Objects.java:372)
    at java.base/java.util.ArrayList.get(ArrayList.java:440)
    at cc.mallet.types.Alphabet.lookupObject(Alphabet.java:154)
    at gate.plugin.learningframework.mallet.LFAlphabet.lookupObject(LFAlphabet.java:40)
    at cc.mallet.types.FeatureSequence.toString(FeatureSequence.java:102)
    at java.base/java.lang.String.valueOf(String.java:2788)
    at java.base/java.lang.StringBuilder.append(StringBuilder.java:135)
    at gate.plugin.learningframework.engines.EngineMBTopicsLDA.applyTopicModel(EngineMBTopicsLDA.java:196)
    at gate.plugin.learningframework.LF_ApplyTopicModel.process(LF_ApplyTopicModel.java:122)
    at gate.plugin.learningframework.AbstractDocumentProcessor.execute(AbstractDocumentProcessor.java:207)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:172)
    at gate.creole.SerialController.executeImpl(SerialController.java:157)
    at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:225)
    at gate.creole.ConditionalSerialAnalyserController.execute(ConditionalSerialAnalyserController.java:132)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1777)
    at java.base/java.lang.Thread.run(Thread.java:844)

This happens when the feature sequence that was created from a new document which was not in the training set gets converted back to string. An index gets looked up in the alphabet using lookupObject(idx) and that index is not in the alphabet, for some reason. So how did it get into the feature sequence in the first place?

johann-petrak commented 6 years ago

It turns out that the Mallet TokenSequence.toFeatureSequence(Alphabet) method adds index -1 entries to the feature sequence for unknown tokens, if the Alphabet is set to not growing. But then any code for converting the FeatureSequence back to String will get the ArrayOutOfBoundsException.

johann-petrak commented 6 years ago

Not sure how to best deal with this. Ideally there would be a way to just add the known tokens to the feature sequence. See https://github.com/mimno/Mallet/issues/138

For now will just construct the feature sequence manually.