GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Move the methods for adding instances from the corpus representation to the engine #6

Closed johann-petrak closed 6 years ago

johann-petrak commented 8 years ago

This should be an internal re-factoring of the API. This is necessary if we want to allow an engine to know and decide of how to best store the instances. Currently we always first store as a Mallet instancelist in memory first, then use that or convert to the format needed at training time. An engine could decide (based on settings like scaling) that instances can be converted from mallet representation to some other representation immediately and e.g. stored externally, then at training time some external program is called for this. Even if we have duplication, the engine could know to store one external file per duplicate and merge them into one file before running the trainer.

johann-petrak commented 6 years ago

We have started to solve this issue in a different way: the engine knows which corpus representation it uses and the corpus representation decides on how to convert and store the instance. See #44 for some comments related to this.