Open johann-petrak opened 7 years ago
To implement this properly and cleanly would require a lot of refactoring. To implement this with minimal effort we could do this:
One problem currently is that in order to export into a format usable for NNs, some of the features may need to have a reverse lookup: finding the string given the numeric feature. Not coding as a number in the first place may be the better solution there, but that would require a lot of change in the code. But maybe we can adapt all uses of NNs to work with vectorized numeric representations already?
Currently Engines are tightly connected to Mallet corpus representations. Whenever an engine is created from what is saved in the data directory, a mallet corpus representation is loaded back in order to get the pipe and other data (which knows how to convert annotations to features at application time).
This means that if we want to use a different non-Mallet representation for some algorithm, we are currently out of luck: too much needs to get changed.
However, for a number of algorithms it would be good to use a different representation, not just switch from in-memory to OOM.
Ideally, the Engine would know which representation to use. The PR creates the engine based on the selected algorithm and any parameters, the engine then returns the corpus representation and the PR uses the add method of that representation to generate instances. The corpus representation converts the instance and either stores it in memory or saves it to a file. At application time, when the engine is restored, the engine also knows how to re-create the corpus representation needed. This would be done by a new instance method recreateCorpusRepresentation(datadir) instead of loadMalletCorpusRepresentation(datadir).
If the engine is always responsible for creating its corpus representation, then it can also parametrize the corpus representation to work OOV, if necessary. This could be dependent on the algorithm or on a variation of the algorithm, or even on an algorithm parameter.
Changed the title of this issue to better represent that this issue is now about supporting different kinds of corpus representation per algorithm/engine, part of which is about OOM representation.
created branch issue44 for this
b0f84ac9d481baaf0f9fa9a4a1528f4b23bb3fe2 Step 1 completed: refactored to allow non-Mallet corpus representations and the Engine to decide which CR to use
Important: the out of core representation option is even more important when exporting: the whole point of exporting may be that there is simply too much data for keeping everything in core!
Merged with master so we can work on other issues. Leaving this open until we actually implement a proper OOC representation or exporter.
We at least must support out of core exporting, ideally would also support OOC training for some engines or algorithms. Maybe for wrapping https://github.com/JohnLangford/vowpal_wabbit and neural networks as well as simple online learning (AROW)
NOTE: see also issue #55