Make it work properly with duplication and parallel processing - Githubissues

GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.

https://gatenlp.github.io/gateplugin-LearningFramework/

GNU Lesser General Public License v2.1

26 stars 6 forks source link

Make it work properly with duplication and parallel processing #9

Closed johann-petrak closed 6 years ago

johann-petrak commented 8 years ago

For training, we may want to create n separate sets and merge them before running the training step after the last document (this should work for in-memory and on-disk approaches)> Then the training will run single-threaded unless the library supports a multithreaded way of training in which case we use several threads based on our own algorithmParameter or, by default, same number of threads as we had duplicates. For application, just use a separate model instance for each duplicate.

johann-petrak commented 7 years ago

For training and exporting, just doing everything in parallel except what has to be done synchronised may be the better option: maybe we can somehow make sure that the following data gets shared and all accesses synchronized:

alphabets for attributes, attribute values and target values (may imply our own feature/attribute representations and the LFPipe instance)
any file handle used for writing out data on the fly

For application, if this is a model for an algorithm running in the JavaVM, a proper implementation of the model may be sharable, but the LibSVM or Mallet models may not be sharable or it may not be easy to find out if they are. As long as the models are not too big, loading several copies may be doable.

For wrapped algorithms, each duplicate currently starts its own process which loads its own model, this may be good with regard to speed but bad with regard to memory use.

For models running in a server, it is up to the server how parallelism is handled.

johann-petrak commented 6 years ago

We also need to make sure that anything that has to run once when starting processing a corpus and once when finished processing the corpus gets properly invoked. Maybe just a shared flag indicating if the run-once for all duplicates job was already carried out and a synchronized approach for updating and checking that flag in each duplicate's started/finished callbacks.

johann-petrak commented 6 years ago

On closer inspection, this looks like something extremely hard if not impossible to achieve when the Mallet representation is used. Probably safer right now to now allow training with any kind of duplication going on, if a Mallet-based algorithm is used.

johann-petrak commented 6 years ago

For the dense corpus, CorpusRepresentationVolatileDense2JsonStream, this may work if:

the same corpus representation instance is used by all duplicates of the PR, this should ensure sharing the same locking object for writing to the stream
the same Engine instance must get shared as well
Some methods must get called exactly once by the first starting or last ending PR or guaranteed before/after that: startAdding()/finishAdding() as well as engine methods for saving info etc.
biggest problem/bottleneck probably the use of the stats objects : currently this is encapsulated in StatsForFeatures (which gets created as part of the CR instance of which there should only be one) and the method addValue there uses its own locking object
since there are two locking objects involved, we should check for possible deadlock sequences, but I think we are fine:
- the write lock is only ever acquired after the same task has released all stat locks
- the stat lock is only ever acquired after the same task has released all write locks

johann-petrak commented 6 years ago

Current plan:

if duplication occurs, share the Engine and hence the corpus representation
make sure that it is clear which duplicate handles the calls that must be done only once by making sure it is e.g. the first to receive controller started and the last to receive controller finished
if the engine returns a corpus representation which does not handle multi-threading, and there is more than one duplicate, abort
implement an interface to signal that a corpus representation can work with duplication

johann-petrak commented 6 years ago

OK, this turns out to be more complicated, because we also really need to take care of when to do the (global and local) initialisations before running on the corpus. Currently this is done through a method where a special method is invoked right before the first document for any duplicate gets invoked. However, this should, if possible, get moved into the controller started callback. There are/were reasons why this is not possible, related to modular pipelines which in turn were related to how controllers and GCP worked in the past, but this may have been fixed by now or be fixable. Investigate. See also https://github.com/GateNLP/gateplugin-ModularPipelines/issues/7

johann-petrak commented 6 years ago

OK https://github.com/GateNLP/gateplugin-ModularPipelines/issues/7 got a fix, so lets proceed on the assumption that this works correctly and change things so everything happens in the controller started callback, but things that should be done globally for all duplicates only happen in the controller started callback for the first instance (duplicate 0).

johann-petrak commented 6 years ago

First step towards this: 7c6c10383006c36a32b29ab30fc90f5ac2432c93

johann-petrak commented 6 years ago

Make it use different engine instances for each duplicate when applying a model: b24d83e6e701e876210e5b396a03a705fb17828f

johann-petrak commented 6 years ago

We need to figure out if and when we can use separate featureSpecification instances for each duplicate.

johann-petrak commented 6 years ago

OK, our idea to subclass Alphabet and LabelAlphabet to make them synchronized does not work because the Mallet Classifier class checks explicitly if the target alphabet for the pipe is assignable from LabelAlphabet. But only superclasses, not subclasses are assignable.

johann-petrak commented 6 years ago

Not sure why this assertion exists in Mallet in the first place, created an issue: https://github.com/mimno/Mallet/issues/132

johann-petrak commented 6 years ago

OK, the only choice we have is to work around this by using the original LabelAlphabet class and trying to synchronize all the places where it gets used explicitly or implicitly.

johann-petrak commented 6 years ago

Training with concurrent duplicates of the PR and the Mallet corpus representation seems to work now, commit 29e153fedfc0f2e306121a0ca8f26e3f02ecd512

johann-petrak commented 6 years ago

Exporting using Mallet corpus representation seems to work now, commit 753b3e3ca48d306c5af2d0c99278ae3da10b8e74

johann-petrak commented 6 years ago

Exporting to dense JSON format also seems to work, this means training based on dense JSON should also work.

johann-petrak commented 6 years ago

Looks good.