Alternative feature extraction approach based on positional unigrams

Motivation

n-gram features can be too sparse to be practical. Also, the number of such features grows very fast. This ticket implements a very simple alternative strategy which may be better in some situations and will provide the users more choice of extraction methods.

Implementation

The positional features are always represent one word/mwe with only one term. However, in addition to the word itself, its position within the context window is stored. The example below clarifies this:

Input text:

This ice cream is sweet.

Input MWE vocabulary:

ice cream

Features generated using trigrams:

this _@_ice
ice this_@_cream
cream ice_@_is
is cream_@_sweet
ice cream this_@_is

Features generated the positional features for context window size of n=1 (the same as for trigrams):

this ice_+1
ice this_-1
ice cream_+1
cream ice_-1
cream is_+1
is cream_-1
is sweet_+1
ice cream this_-1
ice cream is_+1

Some features generated the positional features for context window size of n=2 (the same as for trigrams):

this ice_+1
this cream_+2
ice this_-1
ice cream_+1
ice is_+2
...
...
ice cream this_-1
ice cream is_+1
ice cream sweet_+2

fmarten / JoSimText

Alternative feature extraction approach based on positional unigrams #10

Motivation

Implementation