GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Simple filtering of nominal values and ngrams #105

Closed johann-petrak closed 5 years ago

johann-petrak commented 5 years ago

See #104 - for now just implement filtering based on a missing featureName4Value value (null or 0.0): if the attribute is defined to use such a feature and it is not present or 0.0 skip the feature for an attribute and do not generate any ngrams that include it. For ngrams, this would replace the current imputing of 1.0 in such a situation.

Note that with ngrams for n>1, filtering is more complex, since we should somehow drop all ngrams that would include the filtered string, rather than treating the string as non-existing!

This way of filtering would then be the only effective way to avoid generating ngrams of non-consecutive tokens that have been filtered: if we naively just remove the Token annotation, we would create ngrams of the now-adjacent tokens but that should probably be avoided.

johann-petrak commented 5 years ago

For now, we implement this such that only null/missing value for the featureName4Value feature causes filtering. 0.0 is handled as a proper value.

johann-petrak commented 5 years ago

This is good enough for now.