GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Add START and STOP symbol representations to sequence learning problems. #49

Open johann-petrak opened 7 years ago

johann-petrak commented 7 years ago

This should be doable independently of the algorithm used, i.e. possible for both classifiers and sequence learners

johann-petrak commented 7 years ago

cfa27377f4c48962ebf5a60a2951893073c55e9f adds START and STOP features for attribute lists if the within annotation type is specified. In this case an additional feature is generated if a list element starts where the within annotation starts or ends where the within annotation ends.

johann-petrak commented 7 years ago

8d5d0ecd6789975ed25728c4d1e30e0ded49da91 adds START and STOP features for normal attributes if a within type is specified.

johann-petrak commented 7 years ago

Note that START/STOP features on instances are different from START/STOP elements for sequence tagging: for sequence tagging, the START/STOP symbols must be separate instances in the instance list so that the probabilities for moving from START to the first and moving from the last to STOP can be calculated. Have to check if the CRF implementation of Mallet already does this correctly anyway.

johann-petrak commented 7 years ago

Current START/STOP feature names cannot be mapped back to any feature specification when trying to export to ARFF, for example. Of course not, there is none. We need to return a dummy feature specification for "invented" features. The method FeatureExtraction.lookupAttributeForFeatureName needs to know about the features created by the LF itself.

johann-petrak commented 7 years ago

The ARFF problem has been fixed for now by dealing with not getting anything from the reverse lookup of the specification separately if it is a START/STOP featuer. In that case, we just use the default numeric attribute which works.

johann-petrak commented 7 years ago

Closing this for now on the assumption that at least START is handled correctly internally in the Mallet CRF. Reopen or open another bug if we find out that this is not the case.

johann-petrak commented 7 years ago

Currently the STOP/START features are independent of the actual features from which they are generated. This means that the same feature can get created from several different attribute specifications (which also triggers a warning each time that the feature has already been set). The current schema of the feature name is null|L-2|STOP .. we should probably change this to something like null|L-2|origFeatureName|||STOP where ||| stands for some separator that is different from the one used for value separation.