GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Training set caching / corpusrepresentation caching #86

Open johann-petrak opened 5 years ago

johann-petrak commented 5 years ago

Add a parameter (maybe just something to be used as an "algorithmParameter") to enable training set caching: whatever corpus representation the chosen algorithm uses, that representation will get saved to the data directory (using a name specific to the type of representation) after the trainingset is complete, but before training set finalizing and training itself is done.

If caching is enabled, then if a cache file already exists it should get read in before processing the documents starts to initialize the instance list and then the documents add to that instance list. Caching should probably also save the feature information and compare to the feature info used for the new documents and throw an error if there is a mismatch.

Rationale: this helps in at least two situations:

Note: caching does not make sense with out-of-memory corpus representations like the dense json representation used for Pytorch/Keras since the saved corpus representation already is a kind of cache. The difference/problem is the metadata: in some cases, adding to the corpus would require updating the metadata, so in order to support this properly, we need two functions in the python data backend: