Add a proper TFIDF transformation

johann-petrak commented 5 years ago

Currently we use the CorpusStats plugin to create tfidf scores per token which we can then use together with the featureName4Value option of an attribute to inject into the sparse vector instead of just tf. However, this does not work with n-grams where n>1 (we multiply the scores in that case, but that is not really what we want).

Maybe we can add the CorpusStats code as a subroutine for gathering stats on the fly during the feature extraction, then update the generated instances based on the statistics. This would of course only work for Mallet-representations, not any representation that gets written out immediately.

johann-petrak commented 5 years ago

We would have to gather statistics on a per-attribute basis, probably based on some new flag in the attribute declaration. Currently, we set the sparse vector element of a nominal attribute to 1.0 and the sparse vector element of an ngram to the number of times the ngram occurs within the span.

We should probably make ngram counting configurable (allow to just use 1.0 there as well).

Then, we need a per-attribute stats object that can be updated concurrently and the feature extraction code needs to know that we want to do this. Finally we have to add a transoformer stage to the pipeline for transforming the counts based on the stats according to one of several configurable methods.

Finally we should also allow to filter based on some kind of threshold, e.g. not include the feature if the df is too small or too high etc. So we would have to check if we can remove a feature from a sparse vector at that stage or if just setting it to 0.0 would work equally well (possibly creating a significant number of non-sparse zeroes).

The problem with all this is that in order to make it work properly, a large number of possible approaches and options should get supported.

johann-petrak commented 5 years ago

Maybe we should try to implement some simple filtering first, based on just the DF stats of individual unigrams, even for the ngrams: just filter all ngrams or values where the featureName4Value value is e.g. 0.0 or null. This would allow us to for now use a simple approach where we use Corpusstats and then a groovy script to filter the feature.

Currently, we impute 1.0 if featureName4Value is used but not found, but we could just change this to do the filtering instead.

GateNLP / gateplugin-LearningFramework

Add a proper TFIDF transformation #104