Support out-of-core exporting and training and alternate corpus representations

johann-petrak commented 7 years ago

We at least must support out of core exporting, ideally would also support OOC training for some engines or algorithms. Maybe for wrapping https://github.com/JohnLangford/vowpal_wabbit and neural networks as well as simple online learning (AROW)

NOTE: see also issue #55

johann-petrak commented 7 years ago

To implement this properly and cleanly would require a lot of refactoring. To implement this with minimal effort we could do this:

the Mallet corpus representations gets a setter method for telling them where to save instances to and which exporter to use. If this is set, then instances will always get written instead of getting added to the instances list. We may also need to take care of updating other data (e.g. number of instances) separately instead of relying on getting it from the instances list.
when an engine is created using createEngine, the corpusRepresentation that gets passed on either already has the serialisation mode set or gets it set by the engine since the engine usually knows if this is wanted or not
we need to check all uses of corpus representation to see which of them need to be aware of a serialised corpus. Most importantly: the finish() method and feature scaling!
Any exporter that can be set for exporting instances needs to know how to do the following steps:
- create/open the target file
- convert an instance to the external representation and append it
- close the target file
- finish creation of the exported data: this may involve just adding a metadata file but also e.g. complex processing as in the case of writing to ARFF: export the header file, then create the final arff file by concatenating the header and the data file we just created.

One problem currently is that in order to export into a format usable for NNs, some of the features may need to have a reverse lookup: finding the string given the numeric feature. Not coding as a number in the first place may be the better solution there, but that would require a lot of change in the code. But maybe we can adapt all uses of NNs to work with vectorized numeric representations already?

johann-petrak commented 7 years ago

Currently Engines are tightly connected to Mallet corpus representations. Whenever an engine is created from what is saved in the data directory, a mallet corpus representation is loaded back in order to get the pipe and other data (which knows how to convert annotations to features at application time).

This means that if we want to use a different non-Mallet representation for some algorithm, we are currently out of luck: too much needs to get changed.

However, for a number of algorithms it would be good to use a different representation, not just switch from in-memory to OOM.

Ideally, the Engine would know which representation to use. The PR creates the engine based on the selected algorithm and any parameters, the engine then returns the corpus representation and the PR uses the add method of that representation to generate instances. The corpus representation converts the instance and either stores it in memory or saves it to a file. At application time, when the engine is restored, the engine also knows how to re-create the corpus representation needed. This would be done by a new instance method recreateCorpusRepresentation(datadir) instead of loadMalletCorpusRepresentation(datadir).

If the engine is always responsible for creating its corpus representation, then it can also parametrize the corpus representation to work OOV, if necessary. This could be dependent on the algorithm or on a variation of the algorithm, or even on an algorithm parameter.

johann-petrak commented 7 years ago

Changed the title of this issue to better represent that this issue is now about supporting different kinds of corpus representation per algorithm/engine, part of which is about OOM representation.

johann-petrak commented 7 years ago

created branch issue44 for this

johann-petrak commented 7 years ago

b0f84ac9d481baaf0f9fa9a4a1528f4b23bb3fe2 Step 1 completed: refactored to allow non-Mallet corpus representations and the Engine to decide which CR to use

johann-petrak commented 7 years ago

Important: the out of core representation option is even more important when exporting: the whole point of exporting may be that there is simply too much data for keeping everything in core!

johann-petrak commented 7 years ago

Merged with master so we can work on other issues. Leaving this open until we actually implement a proper OOC representation or exporter.

GateNLP / gateplugin-LearningFramework

Support out-of-core exporting and training and alternate corpus representations #44