GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Refactor JSON exporter for sequences #53

Open johann-petrak opened 7 years ago

johann-petrak commented 7 years ago

Refactor the CorpusExporterJsonSeq methods so that instead of directly exporting the whole list of sequences in the export method, we split the logic into:

However, we also eventually want to be able to write lines (for each sequence) incrementally whenever the corpus representation instance "adds" a new sequence to its "list". If we implement a different corpus representation where "adding" really means writing to a previously opened file handle in some format, then each exporter instance should allow for the following actions:

Eventually, the Exporter instance or parts of it should be shared between duplicates of the PR so that all duplicates can write to the same file in a synchronized way: with this the Mallet feature vector, Mallet fv sequence should be local to the duplicates, the conversion from feature vector and sequence to string should be local/parallel, but the alphabets, maybe the LFPipe instance should be shared and writing the final string to the file handle should be synchronized shared.

This should be in line with making in-memory corpus representations multi-threading compatible: again only the alphabets, LFPipe and the instance list should be shared, adding an instance/sequence to the list has to be synchronized.

johann-petrak commented 7 years ago

e69c5cb0ece9049f022c7c8311503070c13fe8cb now implements the first steps of refactoring and implements the exporter for sequences, classification, and regression.