Move the methods for adding instances from the corpus representation to the engine

GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.

GNU Lesser General Public License v2.1

26 stars 6 forks source link

This should be an internal re-factoring of the API. This is necessary if we want to allow an engine to know and decide of how to best store the instances. Currently we always first store as a Mallet instancelist in memory first, then use that or convert to the format needed at training time. An engine could decide (based on settings like scaling) that instances can be converted from mallet representation to some other representation immediately and e.g. stored externally, then at training time some external program is called for this. Even if we have duplication, the engine could know to store one external file per duplicate and merge them into one file before running the trainer.

GateNLP / gateplugin-LearningFramework

Move the methods for adding instances from the corpus representation to the engine #6