Closed johann-petrak closed 6 years ago
For training and exporting, just doing everything in parallel except what has to be done synchronised may be the better option: maybe we can somehow make sure that the following data gets shared and all accesses synchronized:
For application, if this is a model for an algorithm running in the JavaVM, a proper implementation of the model may be sharable, but the LibSVM or Mallet models may not be sharable or it may not be easy to find out if they are. As long as the models are not too big, loading several copies may be doable.
For wrapped algorithms, each duplicate currently starts its own process which loads its own model, this may be good with regard to speed but bad with regard to memory use.
For models running in a server, it is up to the server how parallelism is handled.
We also need to make sure that anything that has to run once when starting processing a corpus and once when finished processing the corpus gets properly invoked. Maybe just a shared flag indicating if the run-once for all duplicates job was already carried out and a synchronized approach for updating and checking that flag in each duplicate's started/finished callbacks.
On closer inspection, this looks like something extremely hard if not impossible to achieve when the Mallet representation is used. Probably safer right now to now allow training with any kind of duplication going on, if a Mallet-based algorithm is used.
For the dense corpus, CorpusRepresentationVolatileDense2JsonStream, this may work if:
Current plan:
OK, this turns out to be more complicated, because we also really need to take care of when to do the (global and local) initialisations before running on the corpus. Currently this is done through a method where a special method is invoked right before the first document for any duplicate gets invoked. However, this should, if possible, get moved into the controller started callback. There are/were reasons why this is not possible, related to modular pipelines which in turn were related to how controllers and GCP worked in the past, but this may have been fixed by now or be fixable. Investigate. See also https://github.com/GateNLP/gateplugin-ModularPipelines/issues/7
OK https://github.com/GateNLP/gateplugin-ModularPipelines/issues/7 got a fix, so lets proceed on the assumption that this works correctly and change things so everything happens in the controller started callback, but things that should be done globally for all duplicates only happen in the controller started callback for the first instance (duplicate 0).
First step towards this: 7c6c10383006c36a32b29ab30fc90f5ac2432c93
Make it use different engine instances for each duplicate when applying a model: b24d83e6e701e876210e5b396a03a705fb17828f
We need to figure out if and when we can use separate featureSpecification instances for each duplicate.
OK, our idea to subclass Alphabet and LabelAlphabet to make them synchronized does not work because the Mallet Classifier class checks explicitly if the target alphabet for the pipe is assignable from LabelAlphabet. But only superclasses, not subclasses are assignable.
Not sure why this assertion exists in Mallet in the first place, created an issue: https://github.com/mimno/Mallet/issues/132
OK, the only choice we have is to work around this by using the original LabelAlphabet class and trying to synchronize all the places where it gets used explicitly or implicitly.
Training with concurrent duplicates of the PR and the Mallet corpus representation seems to work now, commit 29e153fedfc0f2e306121a0ca8f26e3f02ecd512
Exporting using Mallet corpus representation seems to work now, commit 753b3e3ca48d306c5af2d0c99278ae3da10b8e74
Exporting to dense JSON format also seems to work, this means training based on dense JSON should also work.
Looks good.
For training, we may want to create n separate sets and merge them before running the training step after the last document (this should work for in-memory and on-disk approaches)> Then the training will run single-threaded unless the library supports a multithreaded way of training in which case we use several threads based on our own algorithmParameter or, by default, same number of threads as we had duplicates. For application, just use a separate model instance for each duplicate.