From trigrams to n-grams

Motivation

This is a part of a small series of improvements aiming to enrich the number of feature extractors natively in Spark. Currently, most of such extractors are part of the lefex or the "classical" jobimtext project and are thus only available in hadoop.

Currently, only trigrams can be computed. However, it would make sense to allow users to specify n in the n-gram model (the size of the context window).

Implementation

Specify a command line argument of the https://github.com/uhh-lt/josimtext/blob/master/src/main/scala/de/uhh/lt/jst/dt/Text2TrigramTermContext.scala which will specify the size of the left/right context. The default value of n=1 corresponds to the trigrams and used now. For n=2 we obtain 5-grams, for n=3 we obtain 7-grams, etc.

fmarten / JoSimText

From trigrams to n-grams #5

Motivation

Implementation