[ ] Moving the data out of the code directory. As we work with all languages and all algorithms, these directories became huge (multiple hundreds of GBs) which we would like to keep on a spinning drive and use Windows "compact" on them to speed up retrieval and save some space (they compress into like 10-15% of the original).
[x] The <repodir>/data dir will stay here as it is part of the code
[ ] experiments and results will move to another place pointed by the conf.py file
[ ] Add a delta_upgrade.py script. If you have e.g. full v.18.0 dataset file(s), you can use v19.0 delta releases (usually much more smaller then the full version) to create full v19.0 files. But this need the following steps:
[ ] Expand full v18.0 metadata, expand delta v19.0 metadata
[ ] Merge v19.0 validated & invalidated files and save into full v19.0.
[ ] Here "other" needs special attention, because some of the records might be moved into new validated or invalidated between versions, and new records might be added.
[ ] Use Corpora Creator "s1" algorithm to create default splits
This PR includes (WIP):
<repodir>/data
dir will stay here as it is part of the codeexperiments
andresults
will move to another place pointed by theconf.py
filedelta_upgrade.py
script. If you have e.g. full v.18.0 dataset file(s), you can use v19.0 delta releases (usually much more smaller then the full version) to create full v19.0 files. But this need the following steps: