GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Bugs in cc.mallet.topics.ParallelTopicModel #115

Closed clause closed 4 years ago

clause commented 5 years ago

I think I've found two issues in cc.mallet.topics.ParallelTopicModel.

The first is on Line 245. The loop bound should be tokens.size() (or .getLength()), not topics.length. If an instances has fewer tokens than the minimum capacity of a FeatureSequence (currently 2), then spurious topics will be added.

The second, is the worker's docLengthCounts and topicDocCounts are always incremented but only cleared when alphaStatistics are collected. This results in the counts being optimizeInterval/saveSampleInterval times larger than they should be. topicDocCounts should be cleared every loop or only calculated when alpha stats are. docLengthCounts should only be calculated once (i.e., when the TopicAssignment are created). Caching this computation will save some time and memory.

johann-petrak commented 4 years ago

Please report bugs in the cc.mallet package here: https://github.com/mimno/Mallet/issues