GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Add a way to randomly shuffle the corpus / data file #82

Open johann-petrak opened 6 years ago

johann-petrak commented 6 years ago

It can happen that a corpus contains training instances grouped by class which is very bad for training. In such cases there should be a way to either shuffle the corpus before running the pipeline with the training PR on it, or to shuffle the generated data file before using it (and before splitting of the validation instances).

Doing it inside GATE by providing a meny entry for shuffling on a corpus:

Shuffling the data file:

johann-petrak commented 6 years ago

Implement a Python utility function in the gate-lf-python-data library for doing this by either directly loading all ines into memory, if possible, or creating a list of starting offsets (maybe lengths) and using seek.

Roughly:

idx.append(curoffset, curlinelength)
curoffset += curlinelength
...
# shuffle idx, the go through it and ...
thefile.seek(offset)
line=thefile.readline() # in that case no need to store length, but maybe using lower level read giving length is faster?
# write line to shuffled file