Make it possible to use the original format data

GateNLP / gate-lf-pytorch-json

PyTorch wrapper for the LearningFramework GATE plugin

Apache License 2.0

1 stars 2 forks source link

Make it possible to use the original format data #45

Closed johann-petrak closed 5 years ago

johann-petrak commented 5 years ago

Make it possible to use the original format data so that the model can make use of the actual text and use character embeddings or similar. This will be based on Xingyi's code from https://github.com/GateNLP/AbuseDetection Created branch 1905useorig for this.

johann-petrak commented 5 years ago

TODO: initially there will be a dependency on allennlp for use with the elmo models, figure out a way to remove that dependency. Either make the modelwrapperdefault only depend on this if an elmo model is actually used or have a look to see if we can just merge or reimplement the code needed. That way, only an actual ELMO-based model would depend on that library

johann-petrak commented 5 years ago

Probably best to delegate this to the module somehow: if we have the config parameters elmo or orig set it means that we use a module that needs the original data. In that case we will expect the module to have the methods needed in the the default wrapper in _apply_model

johann-petrak commented 5 years ago

The code by XS for training an elmo model uses a new method to split the validation set, but maybe we can make use of the batch generator code instead?

johann-petrak commented 5 years ago

After merging the fix for #34 the default doc classification results stay the same.

Problem when texting the module TextClassCnnSingleElmo: no proper error handling if elmo model file is missing / not specified.

johann-petrak commented 5 years ago

TODO 1: once we know we do not need the the converted data, avoid generating it. TODO 2: see of we can move the code to generate the target indices from the literal target strings into the module for all training situations

johann-petrak commented 5 years ago

TODO 2: for now we do this in the wrapper, since the proper way to move this into the module is to move the whole evaluation code (either inherit a proper default or override a specific way to do it, e.g. for CNNs)

johann-petrak commented 5 years ago

OK, TODO1 works: if we have elmo/orig, no conversion is done and therefore also no converted dataset kept around.

johann-petrak commented 5 years ago

Quick test of application: seems to work. This can get merged now.