Open johann-petrak opened 6 years ago
Implement a Python utility function in the gate-lf-python-data library for doing this by either directly loading all ines into memory, if possible, or creating a list of starting offsets (maybe lengths) and using seek.
Roughly:
idx.append(curoffset, curlinelength)
curoffset += curlinelength
...
# shuffle idx, the go through it and ...
thefile.seek(offset)
line=thefile.readline() # in that case no need to store length, but maybe using lower level read giving length is faster?
# write line to shuffled file
It can happen that a corpus contains training instances grouped by class which is very bad for training. In such cases there should be a way to either shuffle the corpus before running the pipeline with the training PR on it, or to shuffle the generated data file before using it (and before splitting of the validation instances).
Doing it inside GATE by providing a meny entry for shuffling on a corpus:
Collections.shuffle(list,random)
given that a corpus is a listShuffling the data file:
shuf
on Linux works well, but: no easy way to provide repeatable randomness through a seed, unknown how well it scales beyond available memory