n-gram features can be too sparse to be practical. Also, the number of such features grows very fast. This ticket implements a very simple alternative strategy which may be better in some situations and will provide the users more choice of extraction methods.
Implementation
The positional features are always represent one word/mwe with only one term. However, in addition to the word itself, its position within the context window is stored. The example below clarifies this:
Input text:
This ice cream is sweet.
Input MWE vocabulary:
ice cream
Features generated using trigrams:
this _@_ice
ice this_@_cream
cream ice_@_is
is cream_@_sweet
ice cream this_@_is
Features generated the positional features for context window size of n=1 (the same as for trigrams):
this ice_+1
ice this_-1
ice cream_+1
cream ice_-1
cream is_+1
is cream_-1
is sweet_+1
ice cream this_-1
ice cream is_+1
Some features generated the positional features for context window size of n=2 (the same as for trigrams):
Motivation
n-gram features can be too sparse to be practical. Also, the number of such features grows very fast. This ticket implements a very simple alternative strategy which may be better in some situations and will provide the users more choice of extraction methods.
Implementation
The positional features are always represent one word/mwe with only one term. However, in addition to the word itself, its position within the context window is stored. The example below clarifies this:
Input text:
Input MWE vocabulary:
Features generated using trigrams:
Features generated the positional features for context window size of n=1 (the same as for trigrams):
Some features generated the positional features for context window size of n=2 (the same as for trigrams):