A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
Refactor the CorpusExporterJsonSeq methods so that instead of directly exporting the whole list of sequences in the export method, we split the logic into:
create the string for one feature vector
based on that, create one output line string from one sequence
based on that, first reimplement the method which takes all sequences and writes them to a file
However, we also eventually want to be able to write lines (for each sequence) incrementally whenever the corpus representation instance "adds" a new sequence to its "list". If we implement a different corpus representation where "adding" really means writing to a previously opened file handle in some format, then each exporter instance should allow for the following actions:
initialize, may open the file for writing
return its own instance of a corpus representation (similar to what Engines do now)
export an instance (non-seq) or sequence (seq)
finish/close
Eventually, the Exporter instance or parts of it should be shared between duplicates of the PR so that all duplicates can write to the same file in a synchronized way: with this the Mallet feature vector, Mallet fv sequence should be local to the duplicates, the conversion from feature vector and sequence to string should be local/parallel, but the alphabets, maybe the LFPipe instance should be shared and writing the final string to the file handle should be synchronized shared.
This should be in line with making in-memory corpus representations multi-threading compatible: again only the alphabets, LFPipe and the instance list should be shared, adding an instance/sequence to the list has to be synchronized.
e69c5cb0ece9049f022c7c8311503070c13fe8cb now implements the first steps of refactoring and implements the exporter for sequences, classification, and regression.
Refactor the CorpusExporterJsonSeq methods so that instead of directly exporting the whole list of sequences in the export method, we split the logic into:
However, we also eventually want to be able to write lines (for each sequence) incrementally whenever the corpus representation instance "adds" a new sequence to its "list". If we implement a different corpus representation where "adding" really means writing to a previously opened file handle in some format, then each exporter instance should allow for the following actions:
Eventually, the Exporter instance or parts of it should be shared between duplicates of the PR so that all duplicates can write to the same file in a synchronized way: with this the Mallet feature vector, Mallet fv sequence should be local to the duplicates, the conversion from feature vector and sequence to string should be local/parallel, but the alphabets, maybe the LFPipe instance should be shared and writing the final string to the file handle should be synchronized shared.
This should be in line with making in-memory corpus representations multi-threading compatible: again only the alphabets, LFPipe and the instance list should be shared, adding an instance/sequence to the list has to be synchronized.